您的位置:首页 > 其它

为什么无监督的预训练可以帮助深度学习

2016-08-27 11:17 232 查看
本篇文章主要是review Dumitru Erhan∗,Yoshua Bengio,Aaron Courville,Pierre-Antoine Manzagol 在2010年发表的《why does unsupervised pre-training help deep learning?》.

一 话题导入

最近深度学习框架中比如:Deep Belief Networks,Stacks of Auto-Encoder variants. 都基本基于先通过无监督的预训练,然后再通过有监督的fine tune,达到了最好的效果。

二 实验现象

1. Better Generalization 
When choosing the number of units per layer, the learning rate and the number of training iterationsto optimize classification error on the validation set, unsupervised pre-training gives substantially
lower test classification error than no pre-training, for the same depth or for smaller depth on variousvision data sets.
2. More Robust
These experiments show that the variance of final test error with respect to initialization randomseed is larger without pre-training, and this effect is magnified for deeper architectures. It should
however be noted that there is a limit to the success of this technique: performance degrades for 5layers on this problem. 
3. 总之
 Better generalization that seems to be robust to random initializations is indeed achieved by pre-trained models, which indicates that unsupervised learning of P(X ) is helpful in learning P(Y |X
). 

三 解释原因:
(1)是否来自于Precondition。这种无监督的训练是否是可以帮我们寻找到更好的初始化权重的分布,而不是我们经常采用的[−1/k^0.5,1/k^0.5] 的均匀分布。
答案是否
所谓的Precondition,就是作者重新挑选一个和pre training 后的参数为作为初始参数,注意这里作者是分层进行的,所谓的分层进行就是对于每一层的参数,作者根据多次的pre training结果,构造的经验的分布。然后从每层的经验分布中采样,构成参数的初始点。原文表述如下:
“By conditioning, we mean the range and marginal distribution from which we draw initial weights. In other words, could we get the same performance advantage as unsupervised pre-training if we were
still drawing the initial weights independently, but from a more suitable distribution than the uniform
To verify this, we performed unsupervised pre-training, and computed marginal histograms for each layer’s pre-trained weights and biases (one histogram per each layer’s weights and biases). We then
resampled new “initial” random weights and biases according to these histograms (independently for each parameter), and performed fine-tuning from there. 
The resulting parameters have the same marginal statistics as those obtained after unsupervised pre-training, but not the same joint distribution. ”
结果怎么样呢?
如果是采用uniform的参数的初始化方法,测试集的误差为avg:1.77,std:0.10。如果使用上面的方法,称为Histogram,avg:169,std:0.11. 如果使用无监督的预训练,Unsup,pre,avg:1.37,std:0.09.可以看出这种方法仅仅比uiform好一点。因此precondition是无法解释。
(2)是否来自于无监督的预训练可以通过降低训练集的误差解释呢(The Effect of Pre-training on Training Error )。但是结果也不是
“The remarkable observation is rather that, at a same training cost level, the pre-trained models systematically yield a lower test cost than the randomly initialized ones. The advantage appears to
be one of better generalization rather than merely a better optimization procedure. ”
(3)作者认为,无监督预训练可以为参数,提供先验(prior or regularizer),而且这种先验分布或者说是正则化,与传统的形式不同,它没有显示的正则化项,并且是依赖于数据自动发现。作者原文

unsupervised pre-training appears to have a similar effect to that of a good regularizer or a good “prior” on the parameters, even though no explicit regularization term is apparent in the cost being
optimized. As we stated in the hypothesis, it might be reasoned that restricting the possible starting points in parameter space to those that minimize the unsupervised pre-training criterion (as with the SDAE), does in effect restrict the set of possible
final configurations for parameter values. Like regularizers in general, unsupervised pre-training (in this case, with denoising auto-encoders) might thus be seen as decreasing the variance and introducing a bias (towards parameter configurations suitable
for performing denoising). Unlike ordinary regularizers, unsupervised pre-training does so in a data-dependent manner.”
下面将继续探讨这种特殊化,无具体形式的,并且依赖数据的正则化项来源于何处?
(4)作者认为,如果假设确实来源于正则化,那么个根据正则化典型的一个性质,正则化带来的效用会随着模型的复杂性的增大而增大。作者的设想如下:

Another signature characteristic of regularization is that the effectiveness of regularization increases as capacity (e.g., the number of hidden units) increases, effectively trading off one constraint
on the model complexity for another. In this experiment we explore the relationship between the number ofunits per layer and the effectiveness of unsupervised pre-training. The hypothesis that unsupervised pre-training acts as a regularizer would suggest that
we should see a trend of increasing effectiveness of unsupervised pre-training as the number of units per layer are increased. 

但是实验结果显示,这个效应只有对layer size 足够大(100个隐藏层),网络足够深。无监督的预训练带来的效益才才会随着模型的复杂性增加而增加。对于简单的网络,无监督的预训练反而是多余的。这是一个意外的实验发现。
“What we observe is a more systematic effect: while unsupervised pre-training helps for larger layers and deeper networks, it also appears to hurt for too small networks.”
“As the model size decreases from 800 hidden units, the generalization error increases, and it increases more with unsupervised pre-training presumably because of the extra regularization effect: small
networks have a limited capacity already so further restricting it (or introducing an additional bias) can harm generalization. ”
除了上面一般性解释(简单模型,不需要正则化,因为模型已经很简单)作者给出的解释如下:
The effect can be explained in terms of the role of unsupervised pre-training as promoting input transformations (in the hidden layers) that are useful at capturing the main variations in the input
distribution P(X). It may be that only a small subset of these variations are relevant for predicting the class label Y. When the hidden layers are small it is less likely that the transformations for predicting Y are included in the lot learned by unsupervised
pre-training. 
简单的说就是简单网络的无监督,对X进行变换的时候,可能会把对Y特别有用的特征过滤掉,因为是非监督的,并不能确定有一些X特征会对Y的预测有用。如果是复杂网络,可以保留更多的可能。好像有点道理。

(5)作者不认为 这种有效性来源于优化的结果:Challenging the Optimization Hypothesis。一般来说,由于深度网络往往很难训练,可能这种无监督的预训练可以提升一个使得训练函数更加小的局部最优点。
作者质疑Bengio et al. (2007) 的实验设计,因为在Bengio et al. (2007),中涉及到了“early stopping”,作者认为这种技巧本身就是一种正则化(regularizer)。如果不使用这种技巧,那么结论就不成立了。
“Figure 10 shows what happens without early stopping. The training error is still higher for pre-trained networks even though the generalization error is lower. This result now favors the regularization
hypothesis against the optimization story. What may have happened is that early stopping prevented the networks without pre-training from moving too much towards their apparent local minimum.”
因为对于使用了无监督的预训练的网络最终泛化能力好,但是却训练误差高。因此优化假设是成立的。而因为Bengio et al. (2007) 使用了过停止的技巧,所以导致了没有使用了无监督预训练的网络,也变相使用了一种正则化的技巧,这种技巧导致了网络不会太偏向于局部最优。

下面的问题来了,既然我们已经知道了无监督预训练的魔力来自于某种先验知识(prior)或者正则化(regularizer),那么因为这种正则化并没有确定的正则化项,因此很难判担这种正则化到底是什么样的,因此作者在接下来的实验中,确定这种正则化的内容。首先我们知道,(参见,the elements of statistical learning),其实正则化是来源于贝叶斯定理,通过对参数假设一个先验分布(prior
distribution),通过贝叶斯定理,可以求出参数的后验部分。因此对参数不同的先验分布往往可以推出不同的正则化项。对于我们常见的L1,L2正则化项,其实是假设参数先验分布是一个均值为0的某种概率分布(比如均值为0的正太分布),之所以是均值为0的假设是因为,我们想要得到一个尽量简单的模型。为什么一定要尽量简单的模型呢?这是因为机器学习中一个重要的理论-奥卡姆剃刀定律。好了,说了这么多,我们来看看,通过无监督的预训练的得到的隐式的正则化项是不是L1,L2. (剧透一下,肯定没有那么简单)
(6)揭开正则化的内容,是不是L1,L2.
作者于是比较了对神经网络分别加上了L1,L2正则项,并与通过无监督预训练网络之间区别。
“We found that while in the case ofMNIST
a small penalty can in principle help, the gain is nowhere near as large as it is with pre-training. ForInfiniteMNIST, the optimal amount ofL1
and L2regularization is zero ”
结果发现,在MNIST这种简单的任务上,这种正则化还有一些帮助。在复杂一点的任务InfiniteMNIST,这种正则化基本上是没有价值的。
下面的作者评价说
“This is not an entirely surprising finding: not all regularizers are created equal and these results are consistent with the literature on semi-supervised training that shows that unsupervised learning
can be exploited as a particularly effective form of regularization. ”

Not all regularizers are created equal。

(7)总结一下
总之,是正则化,而且还不是一般的正则化,更不是优化的假设,也不是边际分布所能解释的。是某种特殊的先验分布带来的正则化。而且这种正则化项,和early stopping以及semi-superviesed的原理比较相似。
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
相关文章推荐