您的位置:首页 > Web前端

实用:使用caffe训练模型时solver.prototxt中的参数设置解析

2017-03-31 19:50 656 查看
   笔者之前发布了关于解析caffe的层的博客,解析caffe常用层的博客正在不断更新中。本篇博客是一个插播的博客,目的在彻底解决使用caffe训练模型时的参数设置问题,为什么要发这篇博客呢?是因为笔者最近在自定义网络时,需要构造自己的solver.prototxt,由于之前使用别人的网络时,很多设置参数都没有变,举个例子,下面是caffe官方例程中关于训练LeNet的配置参数文件:

# The train/test net protocol buffer definition
net: "examples/mnist/lenet_train_test.prototxt"
# test_iter specifies how many forward passes the test should carry out.
# In the case of MNIST, we have test batch size 100 and 100 test iterations,
# covering the full 10,000 testing images.
test_iter: 100
# Carry out testing every 500 training iterations.
test_interval: 500
# The base learning rate, momentum and the weight decay of the network.
base_lr: 0.01
momentum: 0.9
weight_decay: 0.0005
# The learning rate policy
lr_policy: "inv"
gamma: 0.0001
power: 0.75
# Display every 100 iterations
display: 100
# The maximum number of iterations
max_iter: 10000
# snapshot intermediate results
snapshot: 5000
snapshot_prefix: "examples/mnist/lenet"
# solver mode: CPU or GPU
solver_mode: GPU   大家可能会觉得非常眼熟,因为我们在其中自己定制了一些参数,比如说最大迭代次数,基础学习率,测试迭代次数等等,可是,对于其中的某些参数,可能新手使用的时候也没有去尝试着改变。那么,接下来笔者就详细地解析一下我们在构造solver.prototxt时怎样去自己定制参数,首先,我们还是翻开caffe,proto,去看一看里面关于solver参数的定义,依照惯例,先附上源码和注释:
message SolverParameter {
//////////////////////////////////////////////////////////////////////////////
// Specifying the train and test networks
//
// Exactly one train net must be specified using one of the following fields:
// train_net_param, train_net, net_param, net
// One or more test nets may be specified using any of the following fields:
// test_net_param, test_net, net_param, net
// If more than one test net field is specified (e.g., both net and
// test_net are specified), they will be evaluated in the field order given
// above: (1) test_net_param, (2) test_net, (3) net_param/net.
// A test_iter must be specified for each test_net.
// A test_level and/or a test_stage may also be specified for each test_net.
//////////////////////////////////////////////////////////////////////////////

// Proto filename for the train net, possibly combined with one or more
// test nets.
optional string net = 24;//定义网络的prototxt文件
// Inline train net param, possibly combined with one or more test nets.
optional NetParameter net_param = 25;//网络的参数

optional string train_net = 1; // Proto filename for the train net.//定义训练网络的prototxt文件
repeated string test_net = 2; // Proto filenames for the test nets.//定义测试网络的prototxt文件
optional NetParameter train_net_param = 21; // Inline train net params.//训练网络的参数
repeated NetParameter test_net_param = 22; // Inline test net params.//测试网络的参数

// The states for the train/test nets. Must be unspecified or
// specified once per net.
//
// By default, all states will have solver = true;
// train_state will have phase = TRAIN,
// and all test_state's will have phase = TEST.
// Other defaults are set according to the NetState defaults.
optional NetState train_state = 26;//指示训练时网络的模式
repeated NetState test_state = 27;//指示测试时网络的模式

// The number of iterations for each test net.
repeated int32 test_iter = 3;//在执行测试时需要迭代的次数,test_iter* 测试集batchsize=测试集总量

// The number of iterations between two testing phases.
optional int32 test_interval = 4 [default = 0];//指示迭代多少次进行一次测试
optional bool test_compute_loss = 19 [default = false];//指示测试的时候要不要计算loss
// If true, run an initial test pass before the first iteration,
// ensuring memory availability and printing the starting value of the loss.
optional bool test_initialization = 32 [default = true];//为真的话,在进行模型训练之前先用随机参数测试模型精度,一般为真
optional float base_lr = 5; // The base learning rate//指示基础学习率
// the number of iterations between displaying info. If display = 0, no info
// will be displayed.
optional int32 display = 6;//指示迭代多少次进行显示
// Display the loss averaged over the last average_loss iterations
optional int32 average_loss = 33 [default = 1];//指示进行多少次测试迭代显示平均精度
optional int32 max_iter = 7; // the maximum number of iterations//指示训练最大迭代次数
// accumulate gradients over `iter_size` x `batch_size` instances
optional int32 iter_size = 36 [default = 1];//指示累计多少个训练批次的梯度,默认为1

// The learning rate decay policy. The currently implemented learning rate
// policies are as follows:
// - fixed: always return base_lr.
// - step: return base_lr * gamma ^ (floor(iter / step))
// - exp: return base_lr * gamma ^ iter
// - inv: return base_lr * (1 + gamma * iter) ^ (- power)
// - multistep: similar to step but it allows non uniform steps defined by
// stepvalue
// - poly: the effective learning rate follows a polynomial decay, to be
// zero by the max_iter. return base_lr (1 - iter/max_iter) ^ (power)
// - sigmoid: the effective learning rate follows a sigmod decay
// return base_lr ( 1/(1 + exp(-gamma * (iter - stepsize))))
//
// where base_lr, max_iter, gamma, step, stepvalue and power are defined
// in the solver parameter protocol buffer, and iter is the current iteration.
optional string lr_policy = 8;//学习率计算形式,可以使用之上的几种.....................................................................(3)
optional float gamma = 9; // The parameter to compute the learning rate.参与学习率计算的参数,详见笔者博客
optional float power = 10; // The parameter to compute the learning rate.参与学习率计算的参数,详见笔者博客
optional float momentum = 11; // The momentum value.动量参数,动量描述了前一次梯度下降的影响因子
optional float weight_decay = 12; // The weight decay. 权重衰减,和损失函数的正则化项有关
// regularization types supported: L1 and L2
// controlled by weight_decay
optional string regularization_type = 29 [default = "L2"];//指示正则化项的形式
// the stepsize for learning rate policy "step"
optional int32 stepsize = 13;//指示step训练方式的步长
// the stepsize for learning rate policy "multistep"
repeated int32 stepvalue = 34;//指示multistep训练方式的步长

// Set clip_gradients to >= 0 to clip parameter gradients to that L2 norm,
// whenever their actual L2 norm is larger.
optional float clip_gradients = 35 [default = -1];//若该参数大于零,把梯度限制在-clip_gradients到clip_gradients之间......................(2)
optional int32 snapshot = 14 [default = 0]; // The snapshot interval//指示训练多少次保存一次参数
optional string snapshot_prefix = 15; // The prefix for the snapshot.//保存的模型参数文件的前缀
// whether to snapshot diff in the results or not. Snapshotting diff will help
// debugging but the final protocol buffer size will be much larger.
optional bool snapshot_diff = 16 [default = false];//指示是否保存网络梯度
enum SnapshotFormat {//模型参数保存格式的枚举
HDF5 = 0;
BINARYPROTO = 1;
}
optional SnapshotFormat snapshot_format = 37 [default = BINARYPROTO];//模型参数的保存格式
// the mode solver will use: 0 for CPU and 1 for GPU. Use GPU in default.
enum SolverMode {//训练模式的枚举,只能为CPU或GPU
CPU = 0;
GPU = 1;
}
optional SolverMode solver_mode = 17 [default = GPU];//训练模式
// the device_id will that be used in GPU mode. Use device_id = 0 in default.
optional int32 device_id = 18 [default = 0];//设备id,使用单GPU训练时为0,指GPU0
// If non-negative, the seed with which the Solver will initialize the Caffe
// random number generator -- useful for reproducible results. Otherwise,
// (and by default) initialize using a seed derived from the system clock.
optional int64 random_seed = 20 [default = -1];//如果random_seed大于零的话,可以产生相同的随机数

// type of the solver
optional string type = 40 [default = "SGD"];//梯度下降形式,一般用SGD............................................................(1)

// numerical stability for RMSProp, AdaGrad and AdaDelta and Adam
optional float delta = 31 [default = 1e-8];//RMSProp, AdaGrad, AdaDelta和Adam梯度下降形式的delta参数
// parameters for the Adam solver
optional float momentum2 = 39 [default = 0.999];//Adam梯度下降形式的动量参数

// RMSProp decay value
// MeanSquare(t) = rms_decay*MeanSquare(t-1) + (1-rms_decay)*SquareGradient(t)
optional float rms_decay = 38 [default = 0.99];//RMSProp梯度下降形式下的衰减率

// If true, print information about the state of the net that may help with
// debugging learning problems.
optional bool debug_info = 23 [default = false];//指示是否打印网络状态的参数

// If false, don't save a snapshot after training finishes.
optional bool snapshot_after_train = 28 [default = true];//true表示在训练结束后保存一次模型,false则反之

// DEPRECATED: old solver enum types, use string instead
enum SolverType {
SGD = 0;
NESTEROV = 1;
ADAGRAD = 2;
RMSPROP = 3;
ADADELTA = 4;
ADAM = 5;
}
// DEPRECATED: use type instead of solver_type
optional SolverType solver_type = 30 [default = SGD];
}
   其中各个参数已经解析的很详细了,笔者认为,在solver.prototxt中,首先定义了有关网络的参数,再定义了有关训练的参数,其中不乏一些我们常用的参数如weight_decay,学习率,学习率改变方式,训练模式,权重更新模式等,请详见上述代码内容和注释。在这之中,笔者想着重说明三点,上面笔者打标记的(1),(2)和(3)点。第(1)点比较简单,就是在选择梯度下降模式的时候,笔者选择的往往是SGD(Stochastic
Gradient Descent),即随机梯度下降并能得到很好的结果,其他的参数下降方式几乎没有用到。

   第二点是clip_gradients参数,这个参数将梯度限制在了一个范围之中,给定了梯度的上限和下限,我们先来看看sgd_solver.cpp中有关这个参数作用的源码:
template <typename Dtype>
void SGDSolver<Dtype>::ClipGradients() {
const Dtype clip_gradients = this->param_.clip_gradients();//获取clip_gradients参数
if (clip_gradients < 0) { return; }//如果该参数小于零,则直接return
const vector<Blob<Dtype>*>& net_params = this->net_->learnable_params();//获取网络的可学习参数
Dtype sumsq_diff = 0;
for (int i = 0; i < net_params.size(); ++i) {
sumsq_diff += net_params[i]->sumsq_diff();//把所有梯度相加
}
const Dtype l2norm_diff = std::sqrt(sumsq_diff);//把梯度相加的结果开方
if (l2norm_diff > clip_gradients) {//如果梯度开方的结果大于clip_gradients
Dtype scale_factor = clip_gradients / l2norm_diff;//那么就求到一个缩放因子
LOG(INFO) << "Gradient clipping: scaling down gradients (L2 norm "
<< l2norm_diff << " > " << clip_gradients << ") "
<< "by scale factor " << scale_factor;
for (int i = 0; i < net_params.size(); ++i) {
net_params[i]->scale_diff(scale_factor);//对每个梯度进行缩放
}
}
}
   从上文的源码清晰可见,clip_gradients参数将梯度进行了范围内的放缩,这个参数的作用是解决了梯度爆炸的问题,在第一次迭代时,有可能梯度会变得很大,而这个参数限制了梯度的范围,clip_gradients这个参数多用于LSTM中。
   第三点是lr_policy参数,这个参数定制了训练过程中学习率的变化过程,基础学习率由base_lr参数定制,而在训练的过程中,基础学习率往往会随着训练过程发生变化,caffe提供了几种学习率变化模式:fixed,step,exp,inv,multistep,poly,sigmoid这七种,其中multistep和step异曲同工,因此我们排除掉multistep,解析一下剩余六种学习率变化模式:

// - fixed: always return base_lr.//学习率不变
// - step: return base_lr * gamma ^ (floor(iter / step))//学习率随一个迭代次数周期下降
// - exp: return base_lr * gamma ^ iter//让迭代次数作为指数,底数为gamma变更学习率
// - inv: return base_lr * (1 + gamma * iter) ^ (- power)//让gamma乘迭代次数加1作为底数,-power作为指数变更学习率
// - poly: the effective learning rate follows a polynomial decay, to be
// zero by the max_iter. return base_lr (1 - iter/max_iter) ^ (power)//学习率的多项式分布衰减
// - sigmoid: the effective learning rate follows a sigmod decay//学习率的sigmoid衰减
// return base_lr ( 1/(1 + exp(-gamma * (iter - stepsize))))
   不过光凭文字描述作用不大,笔者通过图像来向大家阐述各种学习率变化策略,通过MATLAB绘图可以形象的说明,笔者绘制出了五种学习率方式变化曲线,见下图:



   在各种学习率变化策略中, 我们常用的有step,即学习率按迭代周期下降,这是一种很好的改变学习率的方案;其次,LeNet用到了inv方式,代表学习率在迭代之初较高,之后下降较快,exp方式让学习率指数下降,poly方式学习率下降较均匀,sigmoid方式最初保持很低的学习率,在第一个迭代周期后学习率接近base_lr。

   通过学习率变化方式的解析,solver的参数设计中的gamma和power被有效关联了起来,原来这两个参数都是和学习率变更有关的,我们在进行训练参数设计时,完全可以考虑到学习率下降方式,并有效地变更参数。

   到此为止,caffe模型训练中的参数设置解析告一段落了,笔者最大的感受就是,当很多训练相关设置不明白时,第一要解析源码,第二是可以带着问题。规划具体模型训练一番,看看效果如何,并加以思考,经验值就累加了。

   欢迎阅读笔者后续的博客,各位读者朋友的支持与鼓励是我最大的动力!



written by jiong

为了梦想疯狂这一次又怎样

内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: