[深度学习论文笔记][Image Classification] Very Deep Convolutional Networks for Large-Scale Image Recognitio
2016-09-25 16:07
1831 查看
Simonyan, Karen, and Andrew Zisserman. “Very deep convolutional networks for large-scale image recognition.” arXiv preprint arXiv:1409.1556 (2014). [Citations: 1986].
1 Motivation
[Ways to Improve Accuracy]
• Use small fliter size and small stride in the first conv layer.
• Training and testing the network densely over the whole image and over multiple scales.
• Depth of the network.
2 Architecture
[In a Nutshell (138M Parameters)]
• Input (3 × 224 × 224).
• conv1-1 (64@3 × 3, s1, p1), relu1-1.
• conv1-2 (64@3 × 3, s1, p1), relu1-2.
• pool1 (2 × 2, s2), output 64 × 112 × 112.
• conv2-1 (128@3 × 3, s1, p1), relu2-1.
• conv2-2 (128@3 × 3, s1, p1), relu2-2.
• pool2 (2 × 2, s2), output 128 × 56 × 56.
• conv3-1 (256@3 × 3, s1, p1), relu3-1.
• conv3-2 (256@3 × 3, s1, p1), relu3-2.
• conv3-3 (256@3 × 3, s1, p1), relu3-3.
• pool3 (2 × 2, s2), output 256 × 28 × 28.
• conv4-1 (512@3 × 3, s1, p1), relu4-1.
• conv4-2 (512@3 × 3, s1, p1), relu4-2.
• conv4-3 (512@3 × 3, s1, p1), relu4-3.
• pool4 (2 × 2, s2), output 512 × 14 × 14.
• conv5-1 (512@3 × 3, s1, p1), relu5-1.
• conv5-2 (512@3 × 3, s1, p1), relu5-2.
• conv5-3 (512@3 × 3, s1, p1), relu5-3.
• pool5 (2 × 2, s2), output 512 × 7 × 7 = 25088.
• fc6 (4096), relu6, drop6.
• fc7 (4096), relu7, drop7.
• fc8 (1000).
[Data Preparation (Training)] Fixed scale policy.
• Rescale each image such that the shorter side has length 256 or 384.
• Substract the mean activity over the training set from each pixel.
Multi-scale policy.
• Rescale each image such that the shorter side has length in U (256, 512).
• Substract the mean activity over the training set from each pixel.
• Since objects in images can be of different size, it is beneficial to take this into account.
[Data Preparation (Testing)] Fixed scale evaluation: Rescale each image such that the shorter side has length
• 256 or 384 for fixed scale policy.
• 12 (256 + 512) for multi-scale policy.
Multi-Scale evaluation. Rescale each image such that the shorter side has length
• {S − 32, S, S + 32} for fixed scale policy, and average the resulting scores.
• {256, 12 (256+512), 512} for multi-scale policy, and average the resulting scores.
[Data Augmentation (Training)]
• Random crop.
• Horizontal flips.
• Color jittering.
[Data Augmentation (Testing)]
• Convert fc layers to fully convolutional layers, and global average pooling of the final score maps.
• Horizontal flip the images and average the final scores.
[Why 3 × 3 conv?] Stacked conv layers have a large receptive field.
• Two 3 × 3 layers — 5 × 5 receptive field.
• Three 3 × 3 layers — 7 × 7 receptive field.
• But stacked 3 × 3 layers have more non-linearity, which make the decision function more discriminative.
Less parameters
• E.g., both the input and output size are D × H × W .
• A single 7 × 7 layer has parameters: D^2 × 7 × 7 = 49 D^2 .
• Three 3 × 3 layers have parameters: 3 × (D 2 × 3 × 3) = 27 D^2 .
3 Training Details
SGD with momentum of 0.9.
• Batch size 256.
• Weight decay 0.0005.
• Initialize the top 4 conv layers and 3 fc layers from the pre-trained 11 layer model.
• Other weights are initialized from N (0, 0.1 2 ), biases are initialized to 0.
• Base learning rate is 0.01.
• Training 74 epoches.
• Divide the learning rate by 10 when validation error plateaued (3 times).
4 Results
Second place of ILSVRC-2014, for top-5 error
• 1 CNN: 7.0%.
• 7 CNNs: 7.3%.
• 2 best CNNs: 6.8%.
5 Analysis
[Multi-Crop Evaluation is Complementary to Fully Convolutional Evaluation]
• When applying a CNN to a crop, the convolved feature maps are padded with zeros.
• In the case of fully convolutional, the padding for the same crop naturally comes from the neighbouring parts of an image (due to both the convolutions and spatial pooling), which substantially increases the overall network receptive field, so more context
is captured.
[LRN Does Not Improve Accuracy]
[Deeper is Better]
And 3 × 3 conv layer is better than corresponding 1 × 1 conv layer.
[Muti-Scale]
• Multi-Scale policy in training is better than fixed scale policy.
• Multi-Scale evaluation is better than fixed scale policy.
[Ensemble of Two Best-Performing Models is Better than Ensemble of All Models]
6 References
[1]. ILSVRC2014 Talk. https://www.youtube.com/watch?v=j1jIoHN3m0s.
1 Motivation
[Ways to Improve Accuracy]
• Use small fliter size and small stride in the first conv layer.
• Training and testing the network densely over the whole image and over multiple scales.
• Depth of the network.
2 Architecture
[In a Nutshell (138M Parameters)]
• Input (3 × 224 × 224).
• conv1-1 (64@3 × 3, s1, p1), relu1-1.
• conv1-2 (64@3 × 3, s1, p1), relu1-2.
• pool1 (2 × 2, s2), output 64 × 112 × 112.
• conv2-1 (128@3 × 3, s1, p1), relu2-1.
• conv2-2 (128@3 × 3, s1, p1), relu2-2.
• pool2 (2 × 2, s2), output 128 × 56 × 56.
• conv3-1 (256@3 × 3, s1, p1), relu3-1.
• conv3-2 (256@3 × 3, s1, p1), relu3-2.
• conv3-3 (256@3 × 3, s1, p1), relu3-3.
• pool3 (2 × 2, s2), output 256 × 28 × 28.
• conv4-1 (512@3 × 3, s1, p1), relu4-1.
• conv4-2 (512@3 × 3, s1, p1), relu4-2.
• conv4-3 (512@3 × 3, s1, p1), relu4-3.
• pool4 (2 × 2, s2), output 512 × 14 × 14.
• conv5-1 (512@3 × 3, s1, p1), relu5-1.
• conv5-2 (512@3 × 3, s1, p1), relu5-2.
• conv5-3 (512@3 × 3, s1, p1), relu5-3.
• pool5 (2 × 2, s2), output 512 × 7 × 7 = 25088.
• fc6 (4096), relu6, drop6.
• fc7 (4096), relu7, drop7.
• fc8 (1000).
[Data Preparation (Training)] Fixed scale policy.
• Rescale each image such that the shorter side has length 256 or 384.
• Substract the mean activity over the training set from each pixel.
Multi-scale policy.
• Rescale each image such that the shorter side has length in U (256, 512).
• Substract the mean activity over the training set from each pixel.
• Since objects in images can be of different size, it is beneficial to take this into account.
[Data Preparation (Testing)] Fixed scale evaluation: Rescale each image such that the shorter side has length
• 256 or 384 for fixed scale policy.
• 12 (256 + 512) for multi-scale policy.
Multi-Scale evaluation. Rescale each image such that the shorter side has length
• {S − 32, S, S + 32} for fixed scale policy, and average the resulting scores.
• {256, 12 (256+512), 512} for multi-scale policy, and average the resulting scores.
[Data Augmentation (Training)]
• Random crop.
• Horizontal flips.
• Color jittering.
[Data Augmentation (Testing)]
• Convert fc layers to fully convolutional layers, and global average pooling of the final score maps.
• Horizontal flip the images and average the final scores.
[Why 3 × 3 conv?] Stacked conv layers have a large receptive field.
• Two 3 × 3 layers — 5 × 5 receptive field.
• Three 3 × 3 layers — 7 × 7 receptive field.
• But stacked 3 × 3 layers have more non-linearity, which make the decision function more discriminative.
Less parameters
• E.g., both the input and output size are D × H × W .
• A single 7 × 7 layer has parameters: D^2 × 7 × 7 = 49 D^2 .
• Three 3 × 3 layers have parameters: 3 × (D 2 × 3 × 3) = 27 D^2 .
3 Training Details
SGD with momentum of 0.9.
• Batch size 256.
• Weight decay 0.0005.
• Initialize the top 4 conv layers and 3 fc layers from the pre-trained 11 layer model.
• Other weights are initialized from N (0, 0.1 2 ), biases are initialized to 0.
• Base learning rate is 0.01.
• Training 74 epoches.
• Divide the learning rate by 10 when validation error plateaued (3 times).
4 Results
Second place of ILSVRC-2014, for top-5 error
• 1 CNN: 7.0%.
• 7 CNNs: 7.3%.
• 2 best CNNs: 6.8%.
5 Analysis
[Multi-Crop Evaluation is Complementary to Fully Convolutional Evaluation]
• When applying a CNN to a crop, the convolved feature maps are padded with zeros.
• In the case of fully convolutional, the padding for the same crop naturally comes from the neighbouring parts of an image (due to both the convolutions and spatial pooling), which substantially increases the overall network receptive field, so more context
is captured.
[LRN Does Not Improve Accuracy]
[Deeper is Better]
And 3 × 3 conv layer is better than corresponding 1 × 1 conv layer.
[Muti-Scale]
• Multi-Scale policy in training is better than fixed scale policy.
• Multi-Scale evaluation is better than fixed scale policy.
[Ensemble of Two Best-Performing Models is Better than Ensemble of All Models]
6 References
[1]. ILSVRC2014 Talk. https://www.youtube.com/watch?v=j1jIoHN3m0s.
相关文章推荐
- [深度学习] Very Deep Convolutional Networks for Large-Scale Image Recognition(VGGNet)阅读笔记
- 深度学习论文随记(二)---VGGNet模型解读-2014年(Very Deep Convolutional Networks for Large-Scale Image Recognition)
- 深度学习研究理解10:Very Deep Convolutional Networks for Large-Scale Image Recognition
- 深度学习研究理解10:Very Deep Convolutional Networks for Large-Scale Image Recognition
- 论文笔记 | VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE -SCALE IMAGE RECOGNITION
- 深度学习研究理解:Very Deep Convolutional Networks for Large-Scale Image Recognition
- VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION 论文学习
- VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION 这篇论文
- 【Paper Note】Very Deep Convolutional Networks for Large-Scale Image Recognition——VGG(论文理解)
- VGG:Very Deep Convolutional Networks for Large-Scale Image Recognition阅读笔记
- Very Deep Convolutional Networks for Large-Scale Image Recognition—VGG论文翻译—中文版
- 《Very Deep Convolutional Networks for Large-Scale Image Recognition》论文阅读
- VGG-大规模图像识别的深度卷积网络 Very Deep Convolutional Networks for Large-Scale Image Recognition
- VGG-16、VGG-19(论文阅读《Very Deep Convolutional NetWorks for Large-Scale Image Recognition》)
- 深度学习论文笔记:Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition
- Very Deep Convolutional Networks for Large-Scale Image Recognition(VGG模型)
- Very Deep Convolutional Networks for Large-Scale Image Recognition
- VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
- Very Deep Convolutional Networks for Large-Scale Image Recognition
- Very deep convolutional networks for large-scale image recognition