[深度学习论文笔记][Weight Initialization] Random walk initialization for training very deep feedforward netw
2016-09-20 10:08
651 查看
Sussillo, David, and L. F. Abbott. “Random walk initialization for training very deep feedforward networks.” arXiv preprint arXiv:1412.6558 (2014). [Citations: 3].
1 Motivation
[Motivation] Gradient vanishing problem.
[Idea] Keep the gradient norm the same during backprop.
2 Linear Random Walk Initialization
[Network Form]
[Backprop]
[Simplifications]
• All layers have same width n .
• Initialize each W^(l) from N(0, 1/n) .
•
is Gaussian since the product of a Gaussian matrix and a unit vector is a Gaussian
vector.
•
is the squared magnitude of a Gaussian vector, so this term satisfies χ^2_n
.
[Goal] Solving the vanishing gradient problem amounts to keeping fraction the order of 1. Because W^(l) ’s are random, so does the fraction. So we take average.
This is equivent to have weight initialized from variance
3 ReLU Random Walk Initialization
[Equivent Form of ReLU Activations] Zero out half of the rows of W, and leave out other rows unchanged.
• I.e., set (1−β) rows of W to 0 and leaves β rows with Gaussian entries.
• β ∼ Bin(n, 1/2 ) .
• Then
.
[Optimal c] Compute numerical form of E[log χ^2_β ] and we can get
1 Motivation
[Motivation] Gradient vanishing problem.
[Idea] Keep the gradient norm the same during backprop.
2 Linear Random Walk Initialization
[Network Form]
[Backprop]
[Simplifications]
• All layers have same width n .
• Initialize each W^(l) from N(0, 1/n) .
•
is Gaussian since the product of a Gaussian matrix and a unit vector is a Gaussian
vector.
•
is the squared magnitude of a Gaussian vector, so this term satisfies χ^2_n
.
[Goal] Solving the vanishing gradient problem amounts to keeping fraction the order of 1. Because W^(l) ’s are random, so does the fraction. So we take average.
This is equivent to have weight initialized from variance
3 ReLU Random Walk Initialization
[Equivent Form of ReLU Activations] Zero out half of the rows of W, and leave out other rows unchanged.
• I.e., set (1−β) rows of W to 0 and leaves β rows with Gaussian entries.
• β ∼ Bin(n, 1/2 ) .
• Then
.
[Optimal c] Compute numerical form of E[log χ^2_β ] and we can get
相关文章推荐
- [深度学习论文笔记][Weight Initialization] Understanding the difficulty of training deep feedforward neural
- [深度学习论文笔记][Image Classification] Very Deep Convolutional Networks for Large-Scale Image Recognitio
- 深度学习论文笔记 [图像处理] Deep Residual Learning for Image Recognition
- [深度学习论文笔记][Image to Sentence Generation] Deep Visual-Semantic Alignments for Generating Image Descri
- 深度学习论文笔记:Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition
- 【深度学习论文笔记:Recognition】:Deep Neural Networks for Object Detection
- Joint Deep Learning For Pedestrian Detection(论文笔记-深度学习:行人检测)
- [深度学习论文笔记][Image Classification] Deep Residual Learning for Image Recognition
- [深度学习论文笔记][Scene Classification] Learning Deep Features for Scene Recognition using Places Database
- [深度学习论文笔记][Video Classification] Beyond Short Snippets: Deep Networks for Video Classification
- [深度学习] Very Deep Convolutional Networks for Large-Scale Image Recognition(VGGNet)阅读笔记
- 深度学习论文随记(二)---VGGNet模型解读-2014年(Very Deep Convolutional Networks for Large-Scale Image Recognition)
- [深度学习论文笔记][Depth Estimation] Depth Map Prediction from a Single Image using a Multi-Scale Deep Netw
- 深度学习论文笔记-Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition
- [深度学习论文笔记][Weight Initialization] Batch Normalization: Accelerating Deep Network Training by Reducin
- [深度学习论文笔记][Weight Initialization] Exact solutions to the nonlinear dynamics of learning in deep lin
- [深度学习论文笔记][Image Classification] Identity Mappings in Deep Residual Networks
- [深度学习论文笔记][Image Classification] Rethinking the Inception Architecture for Computer Vision
- 深度学习笔记5:Building Deep Networks for Classification