[深度学习论文笔记][Weight Initialization] Batch Normalization: Accelerating Deep Network Training by Reducin
2016-09-20 13:54
639 查看
Ioffe, Sergey, and Christian Szegedy. “Batch normalization: Accelerating deep network training by reducing internal covariate shift.” arXiv preprint arXiv:1502.03167 (2015). [Citations: 532].
1 Motivations
[Internal Covariate Shift] Distributions of each layers’ inputs change during training, as the parameters of all preceding layers change.
• Small changes to the network parameters amplify as the network becomes deeper.
• Layers need to continuously adapt to the new distributions.
• Optimizer would be more likely to get stuck in the saturated regime of nonlinearity, and slow down the convergence.
[Goal] Having the same distributions over time for the ease of training.
2 Methods
[Idea] Normalize the network’s input to have mean ⃗ 0 and covariance I (Whitening).
• Then we have fixed distributions of inputs and remove the ill effects of the internal covariate shift.
• But whitening is expensive, we will normalize each dimension of feature independently. I.e., have each x_j to have mean 0 and variance 1.
• μ_j ’s and σ_j ’s are estimated from each mini-batch.
[Issue] Normalization may change what the network can represent.
• E.g., normalizing the inputs of a sigmoid would constrain them to the linear regime of the nonlinearity.
[Solution] Scale and shift the normalized value:
• γ_j ’s and β_j ’s are learnt form data.
• If γ_j = σ_j , β_j = μ_j , then a_j = x_j .
[Batch Normalizing Transform Algorithm] See Alg. 1.
[Backpropagation]
[Testing] Using moving averages μ_j ’s and σ_j^2 ’s over training mini-batches instead.
[BN Convolutional Networks] Add BN between conv and relu layers.
• Conv layers’ output is more likely to be Gaussian.
• The shape of its output distribution of other nonlinearity is likely to change during training.
We want different elements of the same feature map, at different locations, are normalized in the same way.
• We learn a pair of parameters γ_j and β_j per feature map.
• The effective mini-batch size is mHW .
3 Advantages
[BN Enables Higher Learning Rates] BN prevents small changes to the parameters from amplifying into larger and suboptimal changes in activations in gradients.
[BN Reduces the Strong Dependence on Initialization]
[BN Regularizes the Model] A training example is seen in conjunction with other examples in the mini-batch, which acts as a form of regularization. And it slightly reduces the need for dropout, maybe.
[BN Helps the Network Train Faster and Achieve Higher Accuracy]
[Experiment] GoogLeNet + BN + ensemble of 6 networks
• 4.9% top-5 validation error.
• 3.8% test error.
• Exceeding the accuracy of human raters.
• Cf., GoogLeNet ensemble: 6.67%
4 Notes
[Internal Covariate Shift Revisited] Have the same mean and variance in each input does not mean that the data distributions are the same.
• BN actually prevents gradient vanishing.
• BN will make the scale of activations larger which used to be small.
[When to Use BN] When learning is slow, or meet gradient exploration.
5 References
[1]. X.-S. Wei. https://www.zhihu.com/question/38102762.
[2]. F.-F. Li, A. Karpathy and J. Johnson. http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf.
1 Motivations
[Internal Covariate Shift] Distributions of each layers’ inputs change during training, as the parameters of all preceding layers change.
• Small changes to the network parameters amplify as the network becomes deeper.
• Layers need to continuously adapt to the new distributions.
• Optimizer would be more likely to get stuck in the saturated regime of nonlinearity, and slow down the convergence.
[Goal] Having the same distributions over time for the ease of training.
2 Methods
[Idea] Normalize the network’s input to have mean ⃗ 0 and covariance I (Whitening).
• Then we have fixed distributions of inputs and remove the ill effects of the internal covariate shift.
• But whitening is expensive, we will normalize each dimension of feature independently. I.e., have each x_j to have mean 0 and variance 1.
• μ_j ’s and σ_j ’s are estimated from each mini-batch.
[Issue] Normalization may change what the network can represent.
• E.g., normalizing the inputs of a sigmoid would constrain them to the linear regime of the nonlinearity.
[Solution] Scale and shift the normalized value:
• γ_j ’s and β_j ’s are learnt form data.
• If γ_j = σ_j , β_j = μ_j , then a_j = x_j .
[Batch Normalizing Transform Algorithm] See Alg. 1.
[Backpropagation]
[Testing] Using moving averages μ_j ’s and σ_j^2 ’s over training mini-batches instead.
[BN Convolutional Networks] Add BN between conv and relu layers.
• Conv layers’ output is more likely to be Gaussian.
• The shape of its output distribution of other nonlinearity is likely to change during training.
We want different elements of the same feature map, at different locations, are normalized in the same way.
• We learn a pair of parameters γ_j and β_j per feature map.
• The effective mini-batch size is mHW .
3 Advantages
[BN Enables Higher Learning Rates] BN prevents small changes to the parameters from amplifying into larger and suboptimal changes in activations in gradients.
[BN Reduces the Strong Dependence on Initialization]
[BN Regularizes the Model] A training example is seen in conjunction with other examples in the mini-batch, which acts as a form of regularization. And it slightly reduces the need for dropout, maybe.
[BN Helps the Network Train Faster and Achieve Higher Accuracy]
[Experiment] GoogLeNet + BN + ensemble of 6 networks
• 4.9% top-5 validation error.
• 3.8% test error.
• Exceeding the accuracy of human raters.
• Cf., GoogLeNet ensemble: 6.67%
4 Notes
[Internal Covariate Shift Revisited] Have the same mean and variance in each input does not mean that the data distributions are the same.
• BN actually prevents gradient vanishing.
• BN will make the scale of activations larger which used to be small.
[When to Use BN] When learning is slow, or meet gradient exploration.
5 References
[1]. X.-S. Wei. https://www.zhihu.com/question/38102762.
[2]. F.-F. Li, A. Karpathy and J. Johnson. http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf.
相关文章推荐
- 【深度学习】论文导读:google的批正则方法(Batch Normalization: Accelerating Deep Network Training by Reducing...)
- 论文笔记——《Batch Normalization Accelerating Deep Network Training by Reducing Internal Covariate Shift》
- 【论文阅读笔记】Batch Normalization_Accelerating Deep Network Training by Reducing Internal Covariate Shift
- Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift论文笔记
- Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift 论文笔记
- 深度学习中的数学与技巧(2):《Batch Normalization Accelerating Deep Network Training by Reducing Interna
- Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift 论文翻译(转)
- [深度学习论文笔记][Weight Initialization] Understanding the difficulty of training deep feedforward neural
- Batch Normalization Accelerating Deep Network Training by Reducing Internal Covariate Shift 阅读笔记与实现
- 深度学习论文笔记--Recover Canonical-View Faces in the Wild with Deep Neural Network
- 深度学习论文笔记--Depth Map Prediction from a Single Image using a Multi-Scale Deep Network
- Batch normalization:accelerating deep network training by reducing internal covariate shift的笔记
- 笔记:batch normalization:accelerating deep network training by reducing internal covariate shift
- Deep Learning 27:Batch normalization理解——读论文“Batch normalization: Accelerating deep network training by reducing internal covariate shift ”——ICML 2015
- [论文笔记]Batch Normalization: Accerlerating Deep Network Training by Reducing Internal Covariate Shift
- [深度学习论文笔记][Image Reconstruction] Understanding Deep Image Representations by Inverting Them
- 《Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift》笔记
- [深度学习论文笔记][Weight Initialization] Random walk initialization for training very deep feedforward netw
- Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift 论文翻译
- 《Batch Normalization Accelerating Deep Network Training by Reducing Internal Covariate Shift》阅读笔记与实现