您的位置:首页 > 其它

[深度学习论文笔记][Weight Initialization] 参数初始化部分论文导读

2016-09-22 14:33 375 查看
Training a CNN is hard because

• Large number of parameters requires heavy computation.

• The learning objective is non-convex, and has many poor local minima.

• Deep network has vanishing/exploding gradients problem.

• Need large amount of training data.

As the matter of to handle vanishing/exploding gradients, methods mainly include

• Careful set the learning rate.

• Design better CNN architecture, activation functions, etc.

• Careful initialization of weights.

• Tuning the data distribution.
We will focus on the last two topics to handle vanishing/exploding gradients problem in this section.

Weight initialization is very important in deep learning. I think one of the reasons that early networks did not work as well is that people did not care about it too much. 

Initializing all the weights to 0 is a bad idea since all the neurons learn the same thing. In practice, initialization weight from N (0, 0.01 2 ) or uniform distribution and bias with constant 0 is a popular choice. But this does not work when training
very deep network from scratch, which will lead to extremely large or diminishing outputs/gradients. Large weights lead to divergence while small weights do not allow the network to learn.

[Glorot and Bengio. 2010] proposed Xavier initialization to keep the variance of each neuron among layers the same under the assumption that no non-linearity exists between layers. Lots of inputs correspond to smaller weights, and smaller number of inputs
correspond to larger weights. But Xavier initialization breaks when using ReLU non-linearity. ReLU basicly kill half the distribution, so the output variance halved. [He et al. 2015] extended the Xavier initialization to the ReLU non-linearity by letting the variance
of weights doubled. [Sussillo and Abbott. 2014] keeped constant the norm of the backproppagated errors. [Saxe et al. 2013] showed that orthonormal matrix initialization works better for linear networks than Gaussian noise, it also works for networks with non-linearities.
[Krhenbhl et al. 2015] and [Mishkin and Matas. 2015] did not give a formula for initialization, but they proposed data-driven ways for initialization. They iteratively rescaled weights such that the neurons had roughly unit variance.

[Ioffe and Szegedy. 2015] inserted batch normalization layer to make the output neurons have roughly unit Gaussian distributions. Thus, they reduced the strong dependence on initialization. And they also had scale and shift operations to preserve the capacity.

The notes of those papers mentioned above can be found in the following links (in order):

[深度学习论文笔记][Weight
Initialization] Understanding the difficulty of training deep feedforward neural

[深度学习论文笔记][Weight
Initialization] Delving deep into rectifiers: Surpassing human-level performance

[深度学习论文笔记][Weight
Initialization] Random walk initialization for training very deep feedforward netw

[深度学习论文笔记][Weight
Initialization] Exact solutions to the nonlinear dynamics of learning in deep lin

[深度学习论文笔记][Weight
Initialization] Data-dependent Initializations of Convolutional Neural Networks

[深度学习论文笔记][Weight
Initialization] All you need is a good init

[深度学习论文笔记][Weight
Initialization] Batch Normalization: Accelerating Deep Network Training by Reducin
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
相关文章推荐