对 caffe 中Xavier, msra 权值初始化方式的解释
2016-11-14 17:45
453 查看
If you work through the Caffe
MNIST tutorial, you’ll come across this curious line
and the accompanying explanation
For the weight filler, we will use the xavier algorithm that automatically determines the scale of initialization based on the number of input and output neurons.
Unfortunately, as of the time this post was written, Google hasn’t heard much about “the xavier algorithm”. To work out what it is, you need to poke around the Caffe source until you find the
right docstring and then read the referenced paper, Xavier Glorot & Yoshua Bengio’s Understanding
the difficulty of training deep feedforward neural networks.
In short, it helps signals reach deep into the network.
If the weights in a network start too small, then the signal shrinks as it passes through each layer until it’s too tiny to be useful.
If the weights in a network start too large, then the signal grows as it passes through each layer until it’s too massive to be useful.
Xavier initialization makes sure the weights are ‘just right’, keeping the signal in a reasonable range of values through many layers.
To go any further than this, you’re going to need a small amount of statistics - specifically you need to know about random distributions and their variance.
In Caffe, it’s initializing the weights in your network by drawing them from a distribution with zero mean and a specific variance,
Var(W)=1ninVar(W)=1nin
where WW is
the initialization distribution for the neuron in question, and ninnin is
the number of neurons feeding into it. The distribution used is typically Gaussian or uniform.
It’s worth mentioning that Glorot & Bengio’s paper originally recommended using
Var(W)=2nin+noutVar(W)=2nin+nout
where noutnout is
the number of neurons the result is fed to. We’ll come to why Caffe’s scheme might be different in a bit.
Suppose we have an input XX with nn components
and a linear neuron with random weights WW that
spits out a number YY.
What’s the variance of YY?
Well, we can write
Y=W1X1+W2X2+⋯+WnXnY=W1X1+W2X2+⋯+WnXn
And from
Wikipedia we can work out that WiXiWiXi is
going to have variance
Var(WiXi)=E[Xi]2Var(Wi)+E[Wi]2Var(Xi)+Var(Wi)Var(ii)Var(WiXi)=E[Xi]2Var(Wi)+E[Wi]2Var(Xi)+Var(Wi)Var(ii)
Now if our inputs and weights both have mean 00,
that simplifies to
Var(WiXi)=Var(Wi)Var(Xi)Var(WiXi)=Var(Wi)Var(Xi)
Then if we make a further assumption that the XiXi and WiWi are
all independent and identically distributed, we
can work out that the variance of YY is
Var(Y)=Var(W1X1+W2X2+⋯+WnXn)=nVar(Wi)Var(Xi)Var(Y)=Var(W1X1+W2X2+⋯+WnXn)=nVar(Wi)Var(Xi)
Or in words: the variance of the output is the variance of the input, but scaled by nVar(Wi)nVar(Wi).
So if we want the variance of the input and output to be the same, that means nVar(Wi)nVar(Wi) should
be 1. Which means the variance of the weights should be
Var(Wi)=1n=1ninVar(Wi)=1n=1nin
Voila. There’s your Caffe-style Xavier initialization.
Glorot & Bengio’s formula needs a tiny bit more work. If you go through the same steps for the backpropagated signal, you find that you need
Var(Wi)=1noutVar(Wi)=1nout
to keep the variance of the input gradient & the output gradient the same. These two constraints can only be satisfied simultaneously if nin=noutnin=nout,
so as a compromise, Glorot & Bengio take the average of the two:
Var(Wi)=2nin+noutVar(Wi)=2nin+nout
I’m not sure why the Caffe authors used the ninnin-only
variant. The two possibilities that come to mind are
that preserving the forward-propagated signal is much more important than preserving the back-propagated one.
that for implementation reasons, it’s a pain to find out how many neurons in the next layer consume the output of the current one.
It is. But it works. Xavier initialization was one of the big enablers of the move away from per-layer generative pre-training.
The assumption most worth talking about is the “linear neuron” bit. This is justified in Glorot & Bengio’s paper because immediately after initialization, the parts of the traditional nonlinearities - tanh,sigmtanh,sigm -
that are being explored are the bits close to zero, and where the gradient is close to 11.
For the more recent rectifying nonlinearities, that doesn’t hold, and in a
recent paper by He, Rang, Zhen and Sun they build on Glorot & Bengio and suggest using
Var(W)=2ninVar(W)=2nin
instead. Which makes sense: a rectifying linear unit is zero for half of its input, so you need to double the size of weight variance to keep the signal’s variance constant.
MNIST tutorial, you’ll come across this curious line
weight_filler { type: "xavier" }
and the accompanying explanation
For the weight filler, we will use the xavier algorithm that automatically determines the scale of initialization based on the number of input and output neurons.
Unfortunately, as of the time this post was written, Google hasn’t heard much about “the xavier algorithm”. To work out what it is, you need to poke around the Caffe source until you find the
right docstring and then read the referenced paper, Xavier Glorot & Yoshua Bengio’s Understanding
the difficulty of training deep feedforward neural networks.
Why’s Xavier initialization important?
In short, it helps signals reach deep into the network.If the weights in a network start too small, then the signal shrinks as it passes through each layer until it’s too tiny to be useful.
If the weights in a network start too large, then the signal grows as it passes through each layer until it’s too massive to be useful.
Xavier initialization makes sure the weights are ‘just right’, keeping the signal in a reasonable range of values through many layers.
To go any further than this, you’re going to need a small amount of statistics - specifically you need to know about random distributions and their variance.
Okay, hit me with it. What’s Xavier initialization?
In Caffe, it’s initializing the weights in your network by drawing them from a distribution with zero mean and a specific variance,Var(W)=1ninVar(W)=1nin
where WW is
the initialization distribution for the neuron in question, and ninnin is
the number of neurons feeding into it. The distribution used is typically Gaussian or uniform.
It’s worth mentioning that Glorot & Bengio’s paper originally recommended using
Var(W)=2nin+noutVar(W)=2nin+nout
where noutnout is
the number of neurons the result is fed to. We’ll come to why Caffe’s scheme might be different in a bit.
And where did those formulas come from?
Suppose we have an input XX with nn componentsand a linear neuron with random weights WW that
spits out a number YY.
What’s the variance of YY?
Well, we can write
Y=W1X1+W2X2+⋯+WnXnY=W1X1+W2X2+⋯+WnXn
And from
Wikipedia we can work out that WiXiWiXi is
going to have variance
Var(WiXi)=E[Xi]2Var(Wi)+E[Wi]2Var(Xi)+Var(Wi)Var(ii)Var(WiXi)=E[Xi]2Var(Wi)+E[Wi]2Var(Xi)+Var(Wi)Var(ii)
Now if our inputs and weights both have mean 00,
that simplifies to
Var(WiXi)=Var(Wi)Var(Xi)Var(WiXi)=Var(Wi)Var(Xi)
Then if we make a further assumption that the XiXi and WiWi are
all independent and identically distributed, we
can work out that the variance of YY is
Var(Y)=Var(W1X1+W2X2+⋯+WnXn)=nVar(Wi)Var(Xi)Var(Y)=Var(W1X1+W2X2+⋯+WnXn)=nVar(Wi)Var(Xi)
Or in words: the variance of the output is the variance of the input, but scaled by nVar(Wi)nVar(Wi).
So if we want the variance of the input and output to be the same, that means nVar(Wi)nVar(Wi) should
be 1. Which means the variance of the weights should be
Var(Wi)=1n=1ninVar(Wi)=1n=1nin
Voila. There’s your Caffe-style Xavier initialization.
Glorot & Bengio’s formula needs a tiny bit more work. If you go through the same steps for the backpropagated signal, you find that you need
Var(Wi)=1noutVar(Wi)=1nout
to keep the variance of the input gradient & the output gradient the same. These two constraints can only be satisfied simultaneously if nin=noutnin=nout,
so as a compromise, Glorot & Bengio take the average of the two:
Var(Wi)=2nin+noutVar(Wi)=2nin+nout
I’m not sure why the Caffe authors used the ninnin-only
variant. The two possibilities that come to mind are
that preserving the forward-propagated signal is much more important than preserving the back-propagated one.
that for implementation reasons, it’s a pain to find out how many neurons in the next layer consume the output of the current one.
That seems like an awful lot of assumptions.
It is. But it works. Xavier initialization was one of the big enablers of the move away from per-layer generative pre-training.The assumption most worth talking about is the “linear neuron” bit. This is justified in Glorot & Bengio’s paper because immediately after initialization, the parts of the traditional nonlinearities - tanh,sigmtanh,sigm -
that are being explored are the bits close to zero, and where the gradient is close to 11.
For the more recent rectifying nonlinearities, that doesn’t hold, and in a
recent paper by He, Rang, Zhen and Sun they build on Glorot & Bengio and suggest using
Var(W)=2ninVar(W)=2nin
instead. Which makes sense: a rectifying linear unit is zero for half of its input, so you need to double the size of weight variance to keep the signal’s variance constant.
相关文章推荐
- 卷积神经网络(三):权值初始化方法之Xavier与MSRA
- Caffe中Layer参数的初始化方式
- caffe中权值初始化方法
- utilities(matlab)—— 前馈网络权值矩阵初始化方式
- 深度学习caffe:权值初始化
- 权值的初始化—-xavier
- caffe中权值初始化方法
- caffe中权值初始化方法
- caffe-msra初始化
- Caffe--xavier初始化方法
- Caffe中权值初始化方法
- caffe中权值初始化滤波器
- [深度学习] 权值初始化 xavier和he_normal
- C语言对于没有初始化的数据的处理方式。
- JAVA里面局部变量需要显式初始化,谁能解释编译器为什么这样设计?
- 精通cobol--9.5.1 使用硬性编码方式初始化表
- SOA的解释之三---全新的应用产品构建方式
- 关于NSView 的2种初始化方式
- (转)从JavaScript函数重名看其初始化方式
- List【怪异】的初始化方式