您的位置:首页 > 其它

[深度学习论文笔记][Image Classification] Very Deep Convolutional Networks for Large-Scale Image Recognitio

2016-09-25 16:07 1831 查看
Simonyan, Karen, and Andrew Zisserman. “Very deep convolutional networks for large-scale image recognition.” arXiv preprint arXiv:1409.1556 (2014). [Citations: 1986].

1 Motivation

[Ways to Improve Accuracy]

• Use small fliter size and small stride in the first conv layer.

• Training and testing the network densely over the whole image and over multiple scales.

• Depth of the network.

2 Architecture

[In a Nutshell (138M Parameters)]

• Input (3 × 224 × 224).

• conv1-1 (64@3 × 3, s1, p1), relu1-1.

• conv1-2 (64@3 × 3, s1, p1), relu1-2.

• pool1 (2 × 2, s2), output 64 × 112 × 112.

• conv2-1 (128@3 × 3, s1, p1), relu2-1.

• conv2-2 (128@3 × 3, s1, p1), relu2-2.

• pool2 (2 × 2, s2), output 128 × 56 × 56.

• conv3-1 (256@3 × 3, s1, p1), relu3-1.

• conv3-2 (256@3 × 3, s1, p1), relu3-2.

• conv3-3 (256@3 × 3, s1, p1), relu3-3.

• pool3 (2 × 2, s2), output 256 × 28 × 28.

• conv4-1 (512@3 × 3, s1, p1), relu4-1.

• conv4-2 (512@3 × 3, s1, p1), relu4-2.

• conv4-3 (512@3 × 3, s1, p1), relu4-3.

• pool4 (2 × 2, s2), output 512 × 14 × 14.

• conv5-1 (512@3 × 3, s1, p1), relu5-1.

• conv5-2 (512@3 × 3, s1, p1), relu5-2.

• conv5-3 (512@3 × 3, s1, p1), relu5-3.

• pool5 (2 × 2, s2), output 512 × 7 × 7 = 25088.

• fc6 (4096), relu6, drop6.

• fc7 (4096), relu7, drop7.

• fc8 (1000).

[Data Preparation (Training)] Fixed scale policy.

• Rescale each image such that the shorter side has length 256 or 384.

• Substract the mean activity over the training set from each pixel.

Multi-scale policy.

• Rescale each image such that the shorter side has length in U (256, 512).

• Substract the mean activity over the training set from each pixel.

• Since objects in images can be of different size, it is beneficial to take this into account.

[Data Preparation (Testing)] Fixed scale evaluation: Rescale each image such that the shorter side has length

• 256 or 384 for fixed scale policy.

• 12 (256 + 512) for multi-scale policy.

Multi-Scale evaluation. Rescale each image such that the shorter side has length

• {S − 32, S, S + 32} for fixed scale policy, and average the resulting scores.

• {256, 12 (256+512), 512} for multi-scale policy, and average the resulting scores.

[Data Augmentation (Training)]

• Random crop.

• Horizontal flips.

• Color jittering.

[Data Augmentation (Testing)]

• Convert fc layers to fully convolutional layers, and global average pooling of the final score maps.
• Horizontal flip the images and average the final scores.

[Why 3 × 3 conv?] Stacked conv layers have a large receptive field.

• Two 3 × 3 layers — 5 × 5 receptive field.

• Three 3 × 3 layers — 7 × 7 receptive field.

• But stacked 3 × 3 layers have more non-linearity, which make the decision function more discriminative.

Less parameters

• E.g., both the input and output size are D × H × W .

• A single 7 × 7 layer has parameters: D^2 × 7 × 7 = 49 D^2 .

• Three 3 × 3 layers have parameters: 3 × (D 2 × 3 × 3) = 27 D^2 .

3 Training Details

SGD with momentum of 0.9.

• Batch size 256.

• Weight decay 0.0005.

• Initialize the top 4 conv layers and 3 fc layers from the pre-trained 11 layer model.

• Other weights are initialized from N (0, 0.1 2 ), biases are initialized to 0.

• Base learning rate is 0.01.

• Training 74 epoches.

• Divide the learning rate by 10 when validation error plateaued (3 times).

4 Results

Second place of ILSVRC-2014, for top-5 error

• 1 CNN: 7.0%.

• 7 CNNs: 7.3%.

• 2 best CNNs: 6.8%.

5 Analysis

[Multi-Crop Evaluation is Complementary to Fully Convolutional Evaluation]

• When applying a CNN to a crop, the convolved feature maps are padded with zeros.

• In the case of fully convolutional, the padding for the same crop naturally comes from the neighbouring parts of an image (due to both the convolutions and spatial pooling), which substantially increases the overall network receptive field, so more context
is captured.

[LRN Does Not Improve Accuracy]

[Deeper is Better]

And 3 × 3 conv layer is better than corresponding 1 × 1 conv layer.

[Muti-Scale]

• Multi-Scale policy in training is better than fixed scale policy.
• Multi-Scale evaluation is better than fixed scale policy.

[Ensemble of Two Best-Performing Models is Better than Ensemble of All Models]

6 References

[1]. ILSVRC2014 Talk. https://www.youtube.com/watch?v=j1jIoHN3m0s.
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
相关文章推荐