[深度学习论文笔记][Weight Initialization] Exact solutions to the nonlinear dynamics of learning in deep lin
2016-09-20 09:58
791 查看
Saxe, Andrew M., James L. McClelland, and Surya Ganguli. “Exact solutions to the nonlinear dynamics of learning in deep linear neural net-
works.” arXiv preprint arXiv:1312.6120 (2013). [Citations: 97].
1 General Learning Dynamics of Gradient Descent
[Timescale of Learning]
• Deep net learning time depends on optimal (largest stable) learning
rate.
• The optimal learning rate can be estimated by taking inverse of max-
imal eigenvalue of Hessian over the region of interest.
• Optimal learning rate scales as O(1/L), where L is # of layers.
2 Finding Good Weight Initializations
[Motivations] Unsupervised pretraining speeds up the optimization and act as a special regularizer towards solutions with better generalization performance.
• Unsupervised pretraining finds the special class of orthogonalized, decoupled initial conditions.
• That allow for rapid supervised learning since the network does notneed to adapt the principal directions, but rather only the strength of each layer.
[Idea] Using random orthogonal matrices (W^T W = I).
• Preservation of statistics across layers can imply faster learning.
• Gaussian matrices are almost guaranteed to have many small singular values. This implies that many vectors, either coming up or down the network, will be severely attenuated, hindering learning.
• Xavier initialization preserves norm of random vector on average.
• Orthogonal preserves norm of all vectors exactly.
[Nonlinear Case] A good initialization: The singular values of Jacobian J = ∂⃗a/∂⃗x concentrated around 1.
[Deep networks + large weights] Train exceptionally quickly.
• Large weights incur heavy cost in generalization performance.
• Small initial weights regularize towards smoother functions.
• Training difficulty arises from saddle points, not local minima.
3 References
[1]. Pillow Lab Blog. https://pillowlab.wordpress.com/2015/10/04/exact-solutions-to-the-nonlinear-dynamics-of-learning-in-deep-linear-neural-netw
[2]. ICLR 2014 Talk. https://www.youtube.com/watch?v=Ap7atx-Ki3Q.
works.” arXiv preprint arXiv:1312.6120 (2013). [Citations: 97].
1 General Learning Dynamics of Gradient Descent
[Timescale of Learning]
• Deep net learning time depends on optimal (largest stable) learning
rate.
• The optimal learning rate can be estimated by taking inverse of max-
imal eigenvalue of Hessian over the region of interest.
• Optimal learning rate scales as O(1/L), where L is # of layers.
2 Finding Good Weight Initializations
[Motivations] Unsupervised pretraining speeds up the optimization and act as a special regularizer towards solutions with better generalization performance.
• Unsupervised pretraining finds the special class of orthogonalized, decoupled initial conditions.
• That allow for rapid supervised learning since the network does notneed to adapt the principal directions, but rather only the strength of each layer.
[Idea] Using random orthogonal matrices (W^T W = I).
• Preservation of statistics across layers can imply faster learning.
• Gaussian matrices are almost guaranteed to have many small singular values. This implies that many vectors, either coming up or down the network, will be severely attenuated, hindering learning.
• Xavier initialization preserves norm of random vector on average.
• Orthogonal preserves norm of all vectors exactly.
[Nonlinear Case] A good initialization: The singular values of Jacobian J = ∂⃗a/∂⃗x concentrated around 1.
[Deep networks + large weights] Train exceptionally quickly.
• Large weights incur heavy cost in generalization performance.
• Small initial weights regularize towards smoother functions.
• Training difficulty arises from saddle points, not local minima.
3 References
[1]. Pillow Lab Blog. https://pillowlab.wordpress.com/2015/10/04/exact-solutions-to-the-nonlinear-dynamics-of-learning-in-deep-linear-neural-netw
[2]. ICLR 2014 Talk. https://www.youtube.com/watch?v=Ap7atx-Ki3Q.
相关文章推荐
- 深度学习论文笔记--Recover Canonical-View Faces in the Wild with Deep Neural Network
- [深度学习论文笔记][Weight Initialization] Understanding the difficulty of training deep feedforward neural
- [深度学习论文笔记][Face Recognition] DeepFace: Closing the Gap to Human-Level Performance in Face Verificati
- 论文笔记 Ensemble of Deep Convolutional Neural Networks for Learning to Detect Retinal Vessels in Fundus
- 《The Frontiers of Memory and Attention in Deep Learning》 图文结合详解深度学习Memory & Attention
- Homepage Machine Learning Algorithm 浅谈深度学习中的激活函数 - The Activation Function in Deep Learning
- 深度学习论文阅读笔记--Deep Learning Face Representation from Predicting 10,000 Classes
- FaceBook 论文:DeepFace: Closing the Gap to Human-Level Performance in Face Verification 笔记
- The Future of Real-Time SLAM and "Deep Learning vs SLAM" 深度学习与slam
- [深度学习论文笔记][Image Classification] Identity Mappings in Deep Residual Networks
- The Activation Function in Deep Learning 浅谈深度学习中的激活函数
- 实时SLAM的未来及与深度学习的比较The Future of Real-Time SLAM and “Deep Learning vs SLAM”
- 2015年深度学习淘金热 The Deep Learning Gold Rush of 2015
- 深度学习论文笔记:Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition
- [深度学习论文笔记][Scene Classification] Learning Deep Features for Scene Recognition using Places Database
- 【翻译+原创】DeepFace: Closing the Gap to Human-Level Performance in Face Verification 论文笔记
- [深度学习论文笔记][Image to Sentence Generation] Deep Visual-Semantic Alignments for Generating Image Descri
- [深度学习论文笔记][Image Classification] Deep Residual Learning for Image Recognition
- [深度学习论文笔记][CVPR 16]Deep Metric Learning via Lifted Structured Feature Embedding