您的位置:首页 > Web前端

常见优化算法 (caffe和tensorflow对应参数)

2016-12-05 02:25 495 查看

常见优化算法 (caffe和tensorflow对应参数)

算法可视化





常见算法

SGD

x+= -learning_rate*dx


Momentum

Momentum可以使SGD不至于陷入局部鞍点震荡,同时起到一定加速作用。

Momentum最开始有可能会偏离较远(overshooting the target),但是通常会慢慢矫正回来。

v = mu*v - learning_rate*dx
x+= v


Nesterov momentum

基本思路是每次不在x位置求dx,而是在x+mu*v处更新dx,然后在用动量公式进行计算

相当于每次先到动量的位置,然后求梯度更新

vt=μvt−1−ε▽f(θt−1+μvt−1)

θt=θt−1+vt

计算▽f(θt−1+μvt−1)不太方便,做如下变量替换:ϕt−1=θt−1+μvt−1 ,并带回上述公式可以得到

vt=μvt−1+ε▽f(ϕt−1)

ϕt−1=ϕt−1−μvt−1+(1+μ)vt

v_prev = v
v = mu*v-learning_rate*dx
x += -mu*v_prev+(1+mu)*v


AdaGrad

使用每个变量的历史梯度值累加作为更新的分母,起到平衡不同变量梯度数值差异过大的问题

cache += dx**2
x += -learning_rate*dx/(np.sqrt(cache)+1e-7)


RMSProp

在AdaGrad基础上加入了decay factor,防止历史梯度求和过大

cache = decay_rate*cache + (1-decay_rate)*dx**2
x += -learning_rate*dx/(np.sqrt(cache)+1e-7)


ADAM

初始版本:类似于加入动量的RMSProp

m = beta1*m + (1-beta1)*dx
v = beta2*v + (1-beta2)*(dx**2)
x += -learning_rate*m / (np.sqrt(v)+1e-7)


真实的更新算法如下:

m = beta1*m + (1-beta1)*dx
v = beta2*v + (1-beta2)*(dx**2)
mb = m/(1-beta1**t)   # t is step number
vb = v/(1-beta2**t)
x += -learning_rate*mb / (np.sqrt(vb)+1e-7)


mb和vb起到最开始的时候warm up作用,t很大之后(1-beta1**t) =1

Second Order optimization methods

second-order taylor expansion:

J(θ)≈J(θ0)+(θ−theta0)T+12(θ−θ0)TH(θ−θ0)

θ∗=θ0−H−1▽θJ(θ0)

Quasi_newton methods (BFGS) with approximate inverse Hessian matrix

L-BFGS (limited memory BFGS)

Does not form/store the full inverse Hessian.

Usually works very well in full batch, deterministic mode

实际经验

ADAM通常会取得比较好的结果,同时收敛非常快相比SGD

L-BFGS适用于全batch做优化的情况

有时候可以多种优化方法同时使用,比如使用SGD进行warm up,然后ADAM

对于比较奇怪的需求,deepbit两个loss的收敛需要进行控制的情况,比较慢的SGD比较适用

tensorflow 不同优化算法对应的参数

SGD

optimizer = tf.train.GradientDescentOptimizer(learning_rate=self.learning_rate)

Momentum

optimizer = tf.train.MomentumOptimizer(lr, 0.9)

AdaGrad

optimizer = tf.train.AdagradientOptimizer(learning_rate=self.learning_rate)

RMSProp

optimizer = tf.train.RMSPropOptimizer(0.001, 0.9)

ADAM

optimizer = tf.train.AdamOptimizer(learning_rate=self.learning_rate, epsilon=1e-08)

部分局部参数需要查找tensorflow官方文档

直接进行优化

train_op = optimizer.minimize(loss)

获得提取进行截断等处理

gradients, v = zip(*optimizer.compute_gradients(loss))

gradients, _ = tf.clip_by_global_norm(gradients, self.max_gradient_norm)

train_op = optimizer.apply_gradients(zip(gradients, v), global_step=self.global_step)

Caffe 不同优化算法参数

caffe的优化需要在solver.prototxt中指定相应的参数

type代表的是优化算法

比较坑的是不同的版本之间type会有变化(ADAM or Adam),需要看具体代码

* Stochastic Gradient Descent (type: “SGD”),

* AdaDelta (type: “AdaDelta”),

* Adaptive Gradient (type: “AdaGrad”),

* Adam (type: “Adam”),

* Nesterov’s Accelerated Gradient (type: “Nesterov”) and

* RMSprop (type: “RMSProp”)

SGD

base_lr: 0.01
lr_policy: "step"    # 也可以使用指数,多项式等等
gamma: 0.1
stepsize: 1000
max_iter: 3500
momentum: 0.9


AdaDelta

net: "examples/mnist/lenet_train_test.prototxt"
test_iter: 100
test_interval: 500
base_lr: 1.0
lr_policy: "fixed"
momentum: 0.95
weight_decay: 0.0005
display: 100
max_iter: 10000
snapshot: 5000
snapshot_prefix: "examples/mnist/lenet_adadelta"
solver_mode: GPU
type: "AdaDelta"
delta: 1e-6


AdaGrad

net: "examples/mnist/mnist_autoencoder.prototxt"
test_state: { stage: 'test-on-train' }
test_iter: 500
test_state: { stage: 'test-on-test' }
test_iter: 100
test_interval: 500
test_compute_loss: true
base_lr: 0.01
lr_policy: "fixed"
display: 100
max_iter: 65000
weight_decay: 0.0005
snapshot: 10000
snapshot_prefix: "examples/mnist/mnist_autoencoder_adagrad_train"
# solver mode: CPU or GPU
solver_mode: GPU
type: "AdaGrad"


Nesterov

base_lr: 0.01
lr_policy: "step"
gamma: 0.1
weight_decay: 0.0005
momentum: 0.95
type: "Nesterov"


ADAM

train_net: "nin_train_val.prototxt"
base_lr: 0.001
###############
##### step:base_lr * gamma ^ (floor(iter / stepsize))
#lr_policy: "step"
#gamma: 0.1
#stepsize: 25000
##### multi-step:
#lr_policy: "multistep"
#gamma: 0.5
#stepvalue: 1000
#stepvalue: 2000
#stepvalue: 3000
#stepvalue: 4000
#stepvalue: 5000
#stepvalue: 10000
#stepvalue: 20000
###### inv:base_lr * (1 + gamma * iter) ^ (- power)
# lr_policy: "inv"
# gamma: 0.0001
# power: 2
##### exp:base_lr * gamma ^ iter
# lr_policy: "exp"
# gamma: 0.9
##### poly:base_lr (1 - iter/max_iter) ^ (power)
# lr_policy: "poly"
# power: 0.9
##### sigmoid:base_lr ( 1/(1 + exp(-gamma * (iter - stepsize))))
# lr_policy: "sigmoid"
# gamma: 0.9
#momentum: 0.9
solver_type: ADAM
momentum: 0.9
momentum2: 0.999
delta: 1e-8
lr_policy: "fixed"

display: 100
max_iter: 50000
weight_decay: 0.0005
snapshot: 5000
snapshot_prefix: "./stage1/sgd_DeepBit1024_alex_stage1"
solver_mode: GPU


RMSProp

net: "examples/mnist/lenet_train_test.prototxt"
test_iter: 100
test_interval: 500
base_lr: 1.0
lr_policy: "fixed"
momentum: 0.95
weight_decay: 0.0005
display: 100
max_iter: 10000
snapshot: 5000
snapshot_prefix: "examples/mnist/lenet_adadelta"
solver_mode: GPU
type: "RMSProp"
rms_decay: 0.98
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  caffe-tf优化