您的位置:首页 > 其它

深度学习你不可不知的技巧(上)

2017-05-14 03:10 796 查看
       Deep Neural Networks, especially Convolutional Neural Networks (CNN), allows computational models that are composed of multiple processing layers to learn representations
of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-arts in visual object recognition, object detection, text recognition and many other domains such as drug discovery and genomics.

    In addition, many solid papers have been published in this topic, and some high quality open source CNN software packages have been made available. There are also well-written CNN tutorials or CNN software manuals. However, it might lack a recent and comprehensive
summary about the details of how to implement an excellent deep convolutional neural networks from scratch. Thus, we collected and concluded many implementation details for DCNNs. Here
we will introduce these extensive implementation details, i.e., tricks or tips, for building and training your own deep networks.

Introduction        We assume you already know the basic knowledge of deep learning, and here we will present the implementation details (tricks or tips) in Deep Neural Networks, especially CNN for image-related tasks, mainly in eight
aspects:     1) data augmentation;     2) pre-processing on images;     3) initializations of Networks;     4) some tips during training;     5) selections of activation functions;     6) diverse regularizations;     7)some insights found from figures and
finally     8) methods of ensemble multiple deep networks.

    Additionally, the corresponding slides are available at [slide](http://lamda.nju.edu.cn/weixs/slide/CNNTricks_slide.pdf).
If there are any problems/mistakes in these materials and slides, or there are something important/interesting you consider that should be added, just feel free to contact me(http://lamda.nju.edu.cn/weixs/?AspxAutoDetectCookieSupport=1).

Sec. 1: Data Augmentation        Since deep networks need to be trained on a huge number of training images to achieve satisfactory performance, if the original image data set contains limited training images, it is better to do data augmentation to boost the
performance. Also, data augmentation becomes the thing must to do when training a deep network.1
     There are many ways to do data augmentation, such as the popular horizontally flipping, random crops and color jittering. Moreover, you could try combinations of multiple different processing, e.g., doing the rotation and random scaling at the same
time. In addition, you can try to raise saturation and value (S and V components of the HSV color space) of all pixels to a power between 0.25 and 4 (same for all pixels within a patch), multiply these values by a factor between 0.7 and 1.4, and add to them
a value between -0.1 and 0.1. Also, you could add a value between [-0.1, 0.1] to the hue (H component of HSV) of all pixels in the image/patch.2
     Krizhevsky et al. [1] proposed fancy PCA when training the famous Alex-Net in 2012. Fancy PCA alters the intensities of the RGB channels in training images. In practice, you can firstly perform PCA on the set of RGB pixel values throughout your
training images. And then, for each training image, just add the following quantity to each RGB image pixel (i.e.,

 ): 

 where
, pi and 

 are
the i-th eigenvector and eigenvalue of the 3*3 covariance matrix of RGB pixel values, respectively, and ai  is a random variable drawn from a Gaussian
with mean zero and standard deviation 0.1. Please note that, each ai is drawn only once for all the pixels of a particular training image until that
image is used for training again. That is to say, when the model meets the same training image again, it will randomly produce another ai  for data
augmentation. In [1], they claimed that “fancy PCA could approximately capture an important property of natural images, namely, that object identity is invariant to changes in the intensity and color of the illumination”. To the classification performance,
this scheme reduced the top-1 error rate by over 1% in the competition of ImageNet 2012.

Sec. 2: Pre-Processing        Now we have obtained a large number of training samples (images/crops), but please do not hurry! Actually, it is necessary to do pre-processing on these images/crops. In this section, we will introduce several approaches for pre-processing.

    The first and simple pre-processing approach is zero-center the data, and then normalize them, which is presented as two lines Python codes as follows:

where,
X is the input data (NumIns×NumDim). Another form of this pre-processing normalizes each dimension so that the min and max along the dimension is -1 and 1 respectively. It only makes sense to apply this pre-processing if you have a reason to believe that different
input features have different scales (or units), but they should be of approximately equal importance to the learning algorithm. In case of images, the relative scales of pixels are already approximately equal (and in range from 0 to 255), so it is not strictly
necessary to perform this additional pre-processing step.

    Another pre-processing approach similar to the first one is PCA Whitening. In this process, the data is first centered as described above. Then, you can compute the covariance matrix that tells us about the correlation structure in the data:


    After that, you decorrelate the data by projecting the original (but zero-centered) data into the eigenbasis:

    The
last transformation is whitening, which takes the data in the eigenbasis and divides every dimension by the eigenvalue to normalize the scale:


    Note that here it adds 1e-5 (or a small constant) to prevent division by zero. One weakness of this transformation is that it can greatly exaggerate the noise in the data, since it stretches all dimensions (including the irrelevant dimensions of tiny variance
that are mostly noise) to be of equal size in the input. This can in practice be mitigated by stronger smoothing (i.e., increasing 1e-5 to be a larger number).

    Please note that, we describe these pre-processing here just for completeness. In practice, these transformations are not used with Convolutional Neural Networks. However, it is also very important to zero-center the data, and it is common to see normalization of
every pixel as well.

Sec. 3: Initializations

    Now the data is ready. However, before you are beginning to train the network, you have to initialize its parameters.

1All Zero Initialization     In
the ideal situation, with proper data normalization it is reasonable to assume that approximately half of the weights will be positive and half of them will be negative. A reasonable-sounding idea then might be to set all the initial weights to zero, which
you expect to be the “best guess” in expectation. But, this turns out to be a mistake, because if every neuron in the network computes the same output, then they will also all compute the same gradients during back-propagation and undergo the exact same parameter
updates. In other words, there is no source of asymmetry between neurons if their weights are initialized to be the same.

2Initialization with Small Random Numbers     Thus,
you still want the weights to be very close to zero, but not identically zero. In this way, you can random these neurons to small numbers which are very close to zero, and it is treated as symmetry breaking. The idea is that the neurons are all random and
unique in the beginning, so they will compute distinct updates and integrate themselves as diverse parts of the full network. The implementation for weights might simply look like 

,
where 

 is
a zero mean, unit standard deviation gaussian. It is also possible to use small numbers drawn from a uniform distribution, but this seems to have relatively little impact on the final performance in practice.

3Calibrating the Variances     One problem with the above suggestion is that the
distribution of the outputs from a randomly initialized neuron has a variance that grows with the number of inputs. It turns out that you can normalize the variance of each neuron's output to 1 by scaling its weight vector by the square root of its fan-in (i.e.,
its number of inputs), which is as follows:

where
“randn” is the aforementioned Gaussian and “n” is the number of its inputs. This ensures that all neurons in the network initially have approximately the same output distribution and empirically improves the rate of convergence. The detailed derivations can
be found from Page. 18 to 23 of the slides. Please note that, in the derivations, it does not consider the influence of ReLU neurons.

4Current Recommendation     As aforementioned, the previous initialization by calibrating
the variances of neurons is without considering ReLUs. A more recent paper on this topic by He et al. [4] derives an initialization specifically for ReLUs, reaching the conclusion that the variance of neurons in the network should be 2.0/n  as:


Sec. 4: During Training

    Now, everything is ready. Let’s start to train deep networks!

    Filters and pooling size.     During
training, the size of input images prefers to be power-of-2, such as 32 (e.g., CIFAR-10), 64, 224 (e.g., common used ImageNet), 384 or 512, etc. Moreover, it is important to employ a small filter (e.g., 3*3 ) and small strides (e.g., 1) with zeros-padding,
which not only reduces the number of parameters, but improves the accuracy rates of the whole deep network. Meanwhile, a special case mentioned above, i.e., 3*3 filters with stride 1, could preserve the spatial size of images/feature maps. For the pooling
layers, the common used pooling size is of 2*2.        Learning
rate.    In addition, as described in a blog by Ilya Sutskever [2], he recommended to divide the gradients by mini batch size. Thus, you should not always change the learning rates (LR), if you change the mini batch size. For obtaining an appropriate
LR, utilizing the validation set is an effective way. Usually, a typical value of LR in the beginning of your training is 0.1. In practice, if you see that you stopped making progress on the validation set, divide the LR by 2 (or by 5), and keep going, which
might give you a surprise.

    Fine-tune on pre-trained models.     Nowadays, many state-of-the-arts deep networks are released by famous research groups,
i.e.,Caffe Model Zoo(https://github.com/BVLC/caffe/wiki/Model-Zoo) and VGG Group(http://www.vlfeat.org/matconvnet/pretrained/).
Thanks to the wonderful generalization abilities of pre-trained deep models, you could employ these pre-trained models for your own applications directly. For further improving the classification performance on your data set, a very simple yet effective approach
is to fine-tune the pre-trained models on your own data. As shown in following table, the two most important factors are the size of the new data set (small or big), and its similarity to the original data set. Different strategies of fine-tuning can be utilized
in different situations. For instance, a good case is that your new data set is very similar to the data used for training pre-trained models. In that case, if you have very little data, you can just train a linear classifier on the features extracted from
the top layers of pre-trained models. If your have quite a lot of data at hand, please fine-tune a few top layers of pre-trained models with a small learning rate. However, if your own data set is quite different from the data used in pre-trained models but
with enough training images, a large number of layers should be fine-tuned on your data also with a small learning rate for improving performance. However, if your data set not only contains little data, but is very different from the data used in pre-trained
models, you will be in trouble. Since the data is limited, it seems better to only train a linear classifier. Since the data set is very different, it might not be best to train the classifier from the top of the network, which contains more dataset-specific
features. Instead, it might work better to train the SVM classifier on activations/features from somewhere earlier in the network.



    Fine-tune your data on pre-trained models. Different strategies of fine-tuning are utilized in different situations. For data sets, Caltech-101 is similar toImageNet, where both two are object-centric image data sets; while Place Database is different from ImageNet,
where one is scene-centric and the other is object-centric.

References & Source Links

[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks.(http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf)
In NIPS, 2012

[2] A Brief Overview of Deep Learning (http://yyue.blogspot.com/2015/01/a-brief-overview-of-deep-learning.html/) ,
which is a guest post by Ilya Sutskever.

[3] CS231n: Convolutional Neural Networks for Visual Recognition of Stanford University, held by Prof. Fei-Fei Li and Andrej Karpathy.

[4] K. He, X. Zhang, S. Ren, and J. Sun. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification.(http://arxiv.org/abs/1502.01852)InICCV,
2015.

[5] B. Xu, N. Wang, T. Chen, and M. Li. Empirical Evaluation of Rectified Activations in Convolution Network(http://arxiv.org/abs/1505.00853).
In ICML Deep Learning Workshop, 2015.

[6] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. (http://jmlr.org/papers/v15/srivastava14a.html)JMLR,
15(Jun):1929−1958, 2014.

[7] X.-S. Wei, B.-B. Gao, and J. Wu. Deep Spatial Pyramid Ensemble for Cultural Event Recognition. (http://lamda.nju.edu.cn/weixs/publication/iccvw15_CER.pdf)In ICCV
ChaLearn Looking at People Workshop, 2015.

[8] Z.-H. Zhou. Ensemble Methods: Foundations and Algorithms(https://www.crcpress.com/Ensemble-Methods-Foundations-and-Algorithms/Zhou/9781439830031). Boca
Raton, FL: Chapman & HallCRC/, 2012. (ISBN 978-1-439-830031)

[9] M. Mohammadi, and S. Das. S-NN: Stacked Neural Networks. Project in Stanford CS231n Winter Quarter, 2015.(http://cs231n.stanford.edu/reports/milad_final_report.pdf)

[10] P. Hensman, and D. Masko. The Impact of Imbalanced Training Data for Convolutional Neural Networks.(http://www.diva-portal.org/smash/get/diva2:811111/FULLTEXT01.pdf)
Degree Project in Computer Science, DD143X, 2015.
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: