您的位置:首页 > 编程语言 > Go语言

论文笔记 | Going deeper with convolutions

2016-06-27 16:12 681 查看

Authors

Christian Szegedy Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke Andrew Rabinovich



Christian Szegedy

3 Motivation and high level considerations

Bigger size (depth: number of levels, width: number of units at each level) has two main draw back:

1. A larger number of parameters: more prone to overfitting( a major bottleneck).

2. Dramatically increased use of computational resources.

The fundamental way of solving both issues would be by ultimately moving from full connected to sparsely connected architectures, even inside the convolutions. Bsesides mimicking biological system, this would also have the advantage of firmer theoretical underpinnings due to the groundbreaking work of Arora:

Sanjeev Arora, Aditya Bhaskara, Rong Ge, and Tengyu Ma. Provable bounds for learning
some deep representations. CoRR, abs/1310.6343, 2013.


Their main result states that if the probability distribution of the data-set is representable by a large, very sparse deep neural network, then the optimal network topology can be constructed layer by layer by analyzing the correlation statistics of the activations of the last layer and clustering neurons with hightly coorelated outputs. This statement resonates with the well known Hebbian principle- neurons that fire together, wire together- suggests that the underlying idea is applicable even under less strict conditions, in practice.

http://wenku.baidu.com/link?url=I2V5PaiYh5pziD8kwE6AYMnqQOenj08SwJx0_A1udOh9Vlsv6yGfR8otU3-Nw-oF0EMNG3MoOueaP8hOBFRxZKhpG0lFMKiBhC3afmU7uPC


4 Architectural Details

The main idea of the Inception architecure is based on finding out how an optimal local sparse structure in a convolutional vision network can be approximated and covered by readliy available dense components. Note that assuming translation invariance means that our network will be built from convolutional building blocks. We can assume that each unit from the earlier layer corresponds to some region of the input image and these units are grouped into filter banks, we can also expect that there will be a smaller number of more spatially spread out clusters that can be covered by convolutions over larger patches, and there will be a decreaseing number of patches over larger and larger regions. For convenience, Inception architecture are restricted to filter sizes1x1,3x3,5x5, Additionally, since pooling operatoins have been essential for the success in current state of the art convlutional networks, so we add an pooling layer in each stage.



One big problem with the above modules is that even a modest number of 5x5 convolutions can be prohibitively expensive on top of a convolutional layer with a large number of filters.

This lead to the second idea of the proposed architecture:



1x1 convolutions besides being used as reductions, they also include the use of rectified linear activation which makes them dual-purpose.

For technical reasons(memory efficiency during training), it seemed beneficial to start using Inception modules onely at higher layers.

Beneficial aspects:

1. allow for increasing the units, without uncontrolled blow-up in computational complexity;

2. aligns with the intuition that visual information should be processed at various scales and then aggregated.



It was found that a move from fully connected layers to average pooling improved the top-1 accuracy by about 0.6%, however the use of dropout remained essential even after removing the fully connected layers.

How to keep the ability to propagate gradients back through all the layers in an effective manner?

By adding auxiliary classifiers connected to these intermediate layers, we would expect to1) encorage discrimination in the lower stages in the classifer,2) increase the gradient signal that gets propagated back, and3) provide additional regularization.

Those classifiers take the form of smaller convolutional networks put on top of the outpt of the Inception and modules. During training, their loss gets added to the total loss of the network with a discount weight( the losses of the auxiliary classifiers were weighted by 0.3). At inference time, these auxiliary networks are discarded.

Structure of the network:



1. An average pooling layer with 5x5 filter size and stride3, resulting in an 4x4x512 output for the 4a, and 4x4x528 for the 4d stage.

2. A 1x1 convolution with 128 filters for dimension reduction and rectified linear activation.

3. A fully connected layer with 1024 units and recified linear activation.

4. A dropout layer with 70% ratio of dropped outputs.

5. A linear layer with softmax loss as the classifier

6 Training Methodolgy

Our networks were trained using the DistBelief distributed machine learning system.

Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao,
Marc’aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, Quoc V. Le, and Andrew Y.
Ng. Large scale distributed deep networks. In P. Bartlett, F.c.n. Pereira, C.j.c. Burges, L. Bot-
tou, and K.q. Weinberger, editors, Advances in Neural Information Processing Systems 25,
pages 1232–1240. 2012.


We found that the photometric distortions by Androw Howard were usefu. to combat overfitting.

Andrew G. Howard. Some improvements on deep convolutional neural network based image
classification. CoRR, abs/1312.5402, 2013.


7 2014 Classification

7 versions ensemble, only differ in sampleing methodologies and the random order in which they see input image.

More aggressive cropping approach :4 scales 256,288,320,352=144 crops per image.

simple averaging

8 2014 Detection

Googlenet ‘s improvement through ensemble is obviously,but a single Deep Insight model is more poverful than a single googlenet.



9 Conclusion

Our result seem to yield a solid evidence that apporximating the expected optimal sparse structure by readily avialable dense building blocks is a viable method for imporving neural networks for computer vison.

Authors

Motivation and high level considerations

Architectural Details

Training Methodolgy

2014 Classification

2014 Detection

Conclusion
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: