[深度学习论文笔记][Weight Initialization] 参数初始化部分论文导读
2016-09-22 14:33
375 查看
Training a CNN is hard because
• Large number of parameters requires heavy computation.
• The learning objective is non-convex, and has many poor local minima.
• Deep network has vanishing/exploding gradients problem.
• Need large amount of training data.
As the matter of to handle vanishing/exploding gradients, methods mainly include
• Careful set the learning rate.
• Design better CNN architecture, activation functions, etc.
• Careful initialization of weights.
• Tuning the data distribution.
We will focus on the last two topics to handle vanishing/exploding gradients problem in this section.
Weight initialization is very important in deep learning. I think one of the reasons that early networks did not work as well is that people did not care about it too much.
Initializing all the weights to 0 is a bad idea since all the neurons learn the same thing. In practice, initialization weight from N (0, 0.01 2 ) or uniform distribution and bias with constant 0 is a popular choice. But this does not work when training
very deep network from scratch, which will lead to extremely large or diminishing outputs/gradients. Large weights lead to divergence while small weights do not allow the network to learn.
[Glorot and Bengio. 2010] proposed Xavier initialization to keep the variance of each neuron among layers the same under the assumption that no non-linearity exists between layers. Lots of inputs correspond to smaller weights, and smaller number of inputs
correspond to larger weights. But Xavier initialization breaks when using ReLU non-linearity. ReLU basicly kill half the distribution, so the output variance halved. [He et al. 2015] extended the Xavier initialization to the ReLU non-linearity by letting the variance
of weights doubled. [Sussillo and Abbott. 2014] keeped constant the norm of the backproppagated errors. [Saxe et al. 2013] showed that orthonormal matrix initialization works better for linear networks than Gaussian noise, it also works for networks with non-linearities.
[Krhenbhl et al. 2015] and [Mishkin and Matas. 2015] did not give a formula for initialization, but they proposed data-driven ways for initialization. They iteratively rescaled weights such that the neurons had roughly unit variance.
[Ioffe and Szegedy. 2015] inserted batch normalization layer to make the output neurons have roughly unit Gaussian distributions. Thus, they reduced the strong dependence on initialization. And they also had scale and shift operations to preserve the capacity.
The notes of those papers mentioned above can be found in the following links (in order):
[深度学习论文笔记][Weight
Initialization] Understanding the difficulty of training deep feedforward neural
[深度学习论文笔记][Weight
Initialization] Delving deep into rectifiers: Surpassing human-level performance
[深度学习论文笔记][Weight
Initialization] Random walk initialization for training very deep feedforward netw
[深度学习论文笔记][Weight
Initialization] Exact solutions to the nonlinear dynamics of learning in deep lin
[深度学习论文笔记][Weight
Initialization] Data-dependent Initializations of Convolutional Neural Networks
[深度学习论文笔记][Weight
Initialization] All you need is a good init
[深度学习论文笔记][Weight
Initialization] Batch Normalization: Accelerating Deep Network Training by Reducin
• Large number of parameters requires heavy computation.
• The learning objective is non-convex, and has many poor local minima.
• Deep network has vanishing/exploding gradients problem.
• Need large amount of training data.
As the matter of to handle vanishing/exploding gradients, methods mainly include
• Careful set the learning rate.
• Design better CNN architecture, activation functions, etc.
• Careful initialization of weights.
• Tuning the data distribution.
We will focus on the last two topics to handle vanishing/exploding gradients problem in this section.
Weight initialization is very important in deep learning. I think one of the reasons that early networks did not work as well is that people did not care about it too much.
Initializing all the weights to 0 is a bad idea since all the neurons learn the same thing. In practice, initialization weight from N (0, 0.01 2 ) or uniform distribution and bias with constant 0 is a popular choice. But this does not work when training
very deep network from scratch, which will lead to extremely large or diminishing outputs/gradients. Large weights lead to divergence while small weights do not allow the network to learn.
[Glorot and Bengio. 2010] proposed Xavier initialization to keep the variance of each neuron among layers the same under the assumption that no non-linearity exists between layers. Lots of inputs correspond to smaller weights, and smaller number of inputs
correspond to larger weights. But Xavier initialization breaks when using ReLU non-linearity. ReLU basicly kill half the distribution, so the output variance halved. [He et al. 2015] extended the Xavier initialization to the ReLU non-linearity by letting the variance
of weights doubled. [Sussillo and Abbott. 2014] keeped constant the norm of the backproppagated errors. [Saxe et al. 2013] showed that orthonormal matrix initialization works better for linear networks than Gaussian noise, it also works for networks with non-linearities.
[Krhenbhl et al. 2015] and [Mishkin and Matas. 2015] did not give a formula for initialization, but they proposed data-driven ways for initialization. They iteratively rescaled weights such that the neurons had roughly unit variance.
[Ioffe and Szegedy. 2015] inserted batch normalization layer to make the output neurons have roughly unit Gaussian distributions. Thus, they reduced the strong dependence on initialization. And they also had scale and shift operations to preserve the capacity.
The notes of those papers mentioned above can be found in the following links (in order):
[深度学习论文笔记][Weight
Initialization] Understanding the difficulty of training deep feedforward neural
[深度学习论文笔记][Weight
Initialization] Delving deep into rectifiers: Surpassing human-level performance
[深度学习论文笔记][Weight
Initialization] Random walk initialization for training very deep feedforward netw
[深度学习论文笔记][Weight
Initialization] Exact solutions to the nonlinear dynamics of learning in deep lin
[深度学习论文笔记][Weight
Initialization] Data-dependent Initializations of Convolutional Neural Networks
[深度学习论文笔记][Weight
Initialization] All you need is a good init
[深度学习论文笔记][Weight
Initialization] Batch Normalization: Accelerating Deep Network Training by Reducin
相关文章推荐
- [深度学习论文笔记][Image Classification] 图像分类部分论文导读
- [深度学习论文笔记][Visualizing] 网络可视化部分论文导读
- Deep Learning(深度学习)学习笔记整理系列之LeNet-5卷积参数个人理解
- Deep Learning(深度学习)学习笔记整理系列之LeNet-5卷积参数个人理解
- 蒙特卡罗树搜索+深度学习 -- AlphaGo原版论文阅读笔记
- [深度学习论文笔记][Image Classification] Network in Network
- [深度学习论文笔记][Weight Initialization] All you need is a good init
- 深度学习论文笔记--Recover Canonical-View Faces in the Wild with Deep Neural Network
- 蒙特卡罗树搜索+深度学习 -- AlphaGo原版论文阅读笔记
- [深度学习论文笔记][Optimization] Unit Tests for Stochastic Optimization
- [深度学习论文笔记][Weight Initialization] Understanding the difficulty of training deep feedforward neural
- [深度学习论文笔记][Image Classification] Very Deep Convolutional Networks for Large-Scale Image Recognitio
- 深度学习论文笔记--Depth Map Prediction from a Single Image using a Multi-Scale Deep Network
- [深度学习论文笔记][Weight Initialization] Batch Normalization: Accelerating Deep Network Training by Reducin
- Deep Learning(深度学习)学习笔记整理系列之LeNet-5卷积参数个人理解
- 深度学习论文笔记-Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition
- 深度学习论文阅读笔记--Deep Learning Face Representation from Predicting 10,000 Classes
- 深度学习论文笔记:OverFeat
- [深度学习论文笔记][Image Classification] ImageNet Classification with Deep Convolutional Neural Networks
- [深度学习论文笔记][Weight Initialization] Data-dependent Initializations of Convolutional Neural Networks