Training and test ImageNet using caffe
2016-06-23 16:46
465 查看
original url:
http://caffe.berkeleyvision.org/gathered/examples/imagenet.html http://nbviewer.jupyter.org/github/BVLC/caffe/blob/master/examples/00-classification.ipynb
This guide is meant to get you ready to train your own model on your own data. If you just want an ImageNet-trained network, then note that since training takes a lot of energy and we hate global warming, we provide the CaffeNet model trained as described below
in the model
zoo.
The guide specifies all paths and assumes all commands are executed from the
root caffe directory.
By “ImageNet” we here mean the ILSVRC12 challenge, but you can easily train
on the whole of ImageNet as well, just with more disk space, and a little longer training time.
We assume that you already have downloaded the ImageNet training data and validation data, and they are stored on your disk like:
You will first need to prepare some auxiliary data for training. This data can be downloaded by:
The training and validation input are described in
text listing all the files and their labels. Note that we use a different indexing for labels than the ILSVRC devkit: we sort the synset names in their ASCII order, and then label them from 0 to 999. See
the synset/name mapping.
You may want to resize the images to 256x256 in advance. By default, we do not explicitly do this because in a cluster environment, one may benefit from resizing images in a parallel fashion, using mapreduce. For example, Yangqing used his lightweight mincepie package.
If you prefer things to be simpler, you can also use shell commands, something like:
Take a look at
Set the paths to the train and val dirs as needed, and set “RESIZE=true” to resize all images to 256x256 if you haven’t resized the images in advance. Now simply create the leveldbs with
Note that
not exist before this execution. It will be created by the script.
dumps more information for you to inspect, and you can safely ignore it.
The model requires us to subtract the image mean from each image, so we have to compute the mean.
that - it is also a good example to familiarize yourself on how to manipulate the multiple components, such as protocol buffers, leveldbs, and logging, if you are not familiar with them. Anyway, the mean computation can be carried out as:
which will make
We are going to describe a reference implementation for the approach first proposed by Krizhevsky, Sutskever, and Hinton in their NIPS
2012 paper.
The network definition (
follows the one in Krizhevsky et al. Note that if you deviated from file paths suggested in this guide, you’ll need to adjust the relevant paths in the
If you look carefully at
you will notice several
specifying either
Input layer differences: The training network’s
layer draws its data from
randomly mirrors the input image. The testing network’s
takes data from
does not perform random mirroring.
Output layer differences: Both networks output the
which in training is used to compute the loss function and to initialize the backpropagation, while in validation this loss is simply reported. The testing network also has a second output layer,
which is used to report the accuracy on the test set. In the process of training, the test network will occasionally be instantiated and tested on the test set, producing lines like
We will also lay out a protocol buffer for running the solver. Let’s make a few plans:
We will run in batches of 256, and run a total of 450,000 iterations (about 90 epochs).
For every 1,000 iterations, we test the learned net on the validation data.
We set the initial learning rate to 0.01, and decrease it every 100,000 iterations (about 20 epochs).
Information will be displayed every 20 iterations.
The network will be trained with momentum 0.9 and a weight decay of 0.0005.
For every 10,000 iterations, we will take a snapshot of the current status.
Sound good? This is implemented in
Ready? Let’s train.
Sit back and enjoy!
On a K40 machine, every 20 iterations take about 26.5 seconds to run (while a on a K20 this takes 36 seconds), so effectively about 5.2 ms per image for the full forward-backward pass. About 2 ms of this is on forward, and the rest is backward. If you are interested
in dissecting the computation time, you can run
We all experience times when the power goes out, or we feel like rewarding ourself a little by playing Battlefield (does anyone still remember Quake?). Since we are snapshotting intermediate results during training, we will be able to resume from snapshots.
This can be done as easy as:
where in the script
the solver state snapshot that stores all necessary information to recover the exact solver state (including the parameters, momentum history, etc).
Hope you liked this recipe! Many researchers have gone further since the ILSVRC 2012 challenge, changing the network architecture and/or fine-tuning the various parameters in the network to address new data and tasks. Caffe
lets you explore different network choices more easily by simply writing different prototxt files - isn’t that exciting?
And since now you have a trained network, check out how to use it with the Python interface forclassifying
ImageNet.
In this example we'll classify an image with the bundled CaffeNet model (which is based on the network architecture of Krizhevsky et al. for ImageNet).
We'll compare CPU and GPU modes and then dig into the model to inspect features and the output.
First, set up Python,
In [1]:
Load
In [2]:
If needed, download the reference model ("CaffeNet", a variant of AlexNet).
In [3]:
Set Caffe to CPU mode and load the net from disk.
In [4]:
Set up input preprocessing. (We'll use Caffe's
may be used).
Our default CaffeNet is configured to take images in BGR format. Values are expected to start in the range [0, 255] and then have the mean ImageNet pixel value subtracted from them. In addition, the channel dimension
is expected as the first (outermost) dimension.
As matplotlib will load images with values in the range [0, 1] in RGB format with the channel as the innermost dimension, we are arranging for the needed transformations here.
In [5]:
Now we're ready to perform classification. Even though we'll only classify one image, we'll set a batch size of 50 to demonstrate batching.
In [6]:
Load an image (that comes with Caffe) and perform the preprocessing we've set up.
In [7]:
Out[7]:
Adorable! Let's classify it!
In [8]:
The net gives us a vector of probabilities; the most probable class was the 281st one. But is that correct? Let's check the ImageNet labels...
In [9]:
"Tabby cat" is correct! But let's also look at other top (but less confident predictions).
In [10]:
Out[10]:
We see that less confident predictions are sensible.
Let's see how long classification took, and compare it to GPU mode.
In [11]:
That's a while, even for a batch of 50 images. Let's switch to GPU mode.
In [12]:
That should be much faster!
A net is not just a black box; let's take a look at some of the parameters and intermediate activations.
First we'll see how to read out the structure of the net in terms of activation and parameter shapes.
For each layer, let's look at the activation shapes, which typically have the form
The activations are exposed as an
In [13]:
Now look at the parameter shapes. The parameters are exposed as another
values with either
The param shapes typically have the form
the biases).
In [14]:
Since we're dealing with four-dimensional data here, we'll define a helper function for visualizing sets of rectangular heatmaps.
In [15]:
First we'll look at the first layer filters,
In [16]:
The first layer output,
In [17]:
The fifth layer after pooling,
In [18]:
The first fully connected layer,
We show the output values and the histogram of the positive values
In [19]:
The final probability output,
In [20]:
Out[20]:
Note the cluster of strong predictions; the labels are sorted semantically. The top peaks correspond to the top predicted labels, as shown above.
Now we'll grab an image from the web and classify it using the steps above.
Try setting
In [ ]:
http://caffe.berkeleyvision.org/gathered/examples/imagenet.html http://nbviewer.jupyter.org/github/BVLC/caffe/blob/master/examples/00-classification.ipynb
Brewing ImageNet
This guide is meant to get you ready to train your own model on your own data. If you just want an ImageNet-trained network, then note that since training takes a lot of energy and we hate global warming, we provide the CaffeNet model trained as described belowin the model
zoo.
Data Preparation
The guide specifies all paths and assumes all commands are executed from theroot caffe directory.
By “ImageNet” we here mean the ILSVRC12 challenge, but you can easily train
on the whole of ImageNet as well, just with more disk space, and a little longer training time.
We assume that you already have downloaded the ImageNet training data and validation data, and they are stored on your disk like:
/path/to/imagenet/train/n01440764/n01440764_10026.JPEG /path/to/imagenet/val/ILSVRC2012_val_00000001.JPEG
You will first need to prepare some auxiliary data for training. This data can be downloaded by:
./data/ilsvrc12/get_ilsvrc_aux.sh
The training and validation input are described in
train.txtand
val.txtas
text listing all the files and their labels. Note that we use a different indexing for labels than the ILSVRC devkit: we sort the synset names in their ASCII order, and then label them from 0 to 999. See
synset_words.txtfor
the synset/name mapping.
You may want to resize the images to 256x256 in advance. By default, we do not explicitly do this because in a cluster environment, one may benefit from resizing images in a parallel fashion, using mapreduce. For example, Yangqing used his lightweight mincepie package.
If you prefer things to be simpler, you can also use shell commands, something like:
for name in /path/to/imagenet/val/*.JPEG; do convert -resize 256x256\! $name $name done
Take a look at
examples/imagenet/create_imagenet.sh.
Set the paths to the train and val dirs as needed, and set “RESIZE=true” to resize all images to 256x256 if you haven’t resized the images in advance. Now simply create the leveldbs with
examples/imagenet/create_imagenet.sh.
Note that
examples/imagenet/ilsvrc12_train_leveldband
examples/imagenet/ilsvrc12_val_leveldbshould
not exist before this execution. It will be created by the script.
GLOG_logtostderr=1simply
dumps more information for you to inspect, and you can safely ignore it.
Compute Image Mean
The model requires us to subtract the image mean from each image, so we have to compute the mean. tools/compute_image_mean.cppimplements
that - it is also a good example to familiarize yourself on how to manipulate the multiple components, such as protocol buffers, leveldbs, and logging, if you are not familiar with them. Anyway, the mean computation can be carried out as:
./examples/imagenet/make_imagenet_mean.sh
which will make
data/ilsvrc12/imagenet_mean.binaryproto.
Model Definition
We are going to describe a reference implementation for the approach first proposed by Krizhevsky, Sutskever, and Hinton in their NIPS2012 paper.
The network definition (
models/bvlc_reference_caffenet/train_val.prototxt)
follows the one in Krizhevsky et al. Note that if you deviated from file paths suggested in this guide, you’ll need to adjust the relevant paths in the
.prototxtfiles.
If you look carefully at
models/bvlc_reference_caffenet/train_val.prototxt,
you will notice several
includesections
specifying either
phase: TRAINor
phase: TEST. These sections allow us to define two closely related networks in one file: the network used for training and the network used for testing. These two networks are almost identical, sharing all layers except for those marked with
include { phase: TRAIN }or
include { phase: TEST }. In this case, only the input layers and one output layer are different.
Input layer differences: The training network’s
datainput
layer draws its data from
examples/imagenet/ilsvrc12_train_leveldband
randomly mirrors the input image. The testing network’s
datalayer
takes data from
examples/imagenet/ilsvrc12_val_leveldband
does not perform random mirroring.
Output layer differences: Both networks output the
softmax_losslayer,
which in training is used to compute the loss function and to initialize the backpropagation, while in validation this loss is simply reported. The testing network also has a second output layer,
accuracy,
which is used to report the accuracy on the test set. In the process of training, the test network will occasionally be instantiated and tested on the test set, producing lines like
Test score #0: xxxand
Test score #1: xxx. In this case score 0 is the accuracy (which will start around 1/1000 = 0.001 for an untrained network) and score 1 is the loss (which will start around 7 for an untrained network).
We will also lay out a protocol buffer for running the solver. Let’s make a few plans:
We will run in batches of 256, and run a total of 450,000 iterations (about 90 epochs).
For every 1,000 iterations, we test the learned net on the validation data.
We set the initial learning rate to 0.01, and decrease it every 100,000 iterations (about 20 epochs).
Information will be displayed every 20 iterations.
The network will be trained with momentum 0.9 and a weight decay of 0.0005.
For every 10,000 iterations, we will take a snapshot of the current status.
Sound good? This is implemented in
models/bvlc_reference_caffenet/solver.prototxt.
Training ImageNet
Ready? Let’s train../build/tools/caffe train --solver=models/bvlc_reference_caffenet/solver.prototxt
Sit back and enjoy!
On a K40 machine, every 20 iterations take about 26.5 seconds to run (while a on a K20 this takes 36 seconds), so effectively about 5.2 ms per image for the full forward-backward pass. About 2 ms of this is on forward, and the rest is backward. If you are interested
in dissecting the computation time, you can run
./build/tools/caffe time --model=models/bvlc_reference_caffenet/train_val.prototxt
Resume Training?
We all experience times when the power goes out, or we feel like rewarding ourself a little by playing Battlefield (does anyone still remember Quake?). Since we are snapshotting intermediate results during training, we will be able to resume from snapshots.This can be done as easy as:
./build/tools/caffe train --solver=models/bvlc_reference_caffenet/solver.prototxt --snapshot=models/bvlc_reference_caffenet/caffenet_train_iter_10000.solverstate
where in the script
caffenet_train_iter_10000.solverstateis
the solver state snapshot that stores all necessary information to recover the exact solver state (including the parameters, momentum history, etc).
Parting Words
Hope you liked this recipe! Many researchers have gone further since the ILSVRC 2012 challenge, changing the network architecture and/or fine-tuning the various parameters in the network to address new data and tasks. Caffelets you explore different network choices more easily by simply writing different prototxt files - isn’t that exciting?
And since now you have a trained network, check out how to use it with the Python interface forclassifying
ImageNet.
Classification: Instant Recognition with Caffe
In this example we'll classify an image with the bundled CaffeNet model (which is based on the network architecture of Krizhevsky et al. for ImageNet).We'll compare CPU and GPU modes and then dig into the model to inspect features and the output.
1. Setup
First, set up Python, numpy, and
matplotlib.
In [1]:
# set up Python environment: numpy for numerical routines, and matplotlib for plotting import numpy as np import matplotlib.pyplot as plt # display plots in this notebook %matplotlib inline # set display defaults plt.rcParams['figure.figsize'] = (10, 10) # large images plt.rcParams['image.interpolation'] = 'nearest' # don't interpolate: show square pixels plt.rcParams['image.cmap'] = 'gray' # use grayscale output rather than a (potentially misleading) color heatmap
Load
caffe.
In [2]:
# The caffe module needs to be on the Python path; # we'll add it here explicitly. import sys caffe_root = '../' # this file should be run from {caffe_root}/examples (otherwise change this line) sys.path.insert(0, caffe_root + 'python') import caffe # If you get "No module named _caffe", either you have not built pycaffe or you have the wrong path.
If needed, download the reference model ("CaffeNet", a variant of AlexNet).
In [3]:
import os if os.path.isfile(caffe_root + 'models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel'): print 'CaffeNet found.' else: print 'Downloading pre-trained CaffeNet model...' !../scripts/download_model_binary.py ../models/bvlc_reference_caffenet
CaffeNet found.
2. Load net and set up input preprocessing
Set Caffe to CPU mode and load the net from disk.In [4]:
caffe.set_mode_cpu() model_def = caffe_root + 'models/bvlc_reference_caffenet/deploy.prototxt' model_weights = caffe_root + 'models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel' net = caffe.Net(model_def, # defines the structure of the model model_weights, # contains the trained weights caffe.TEST) # use test mode (e.g., don't perform dropout)
Set up input preprocessing. (We'll use Caffe's
caffe.io.Transformerto do this, but this step is independent of other parts of Caffe, so any custom preprocessing code
may be used).
Our default CaffeNet is configured to take images in BGR format. Values are expected to start in the range [0, 255] and then have the mean ImageNet pixel value subtracted from them. In addition, the channel dimension
is expected as the first (outermost) dimension.
As matplotlib will load images with values in the range [0, 1] in RGB format with the channel as the innermost dimension, we are arranging for the needed transformations here.
In [5]:
# load the mean ImageNet image (as distributed with Caffe) for subtraction mu = np.load(caffe_root + 'python/caffe/imagenet/ilsvrc_2012_mean.npy') mu = mu.mean(1).mean(1) # average over pixels to obtain the mean (BGR) pixel values print 'mean-subtracted values:', zip('BGR', mu) # create transformer for the input called 'data' transformer = caffe.io.Transformer({'data': net.blobs['data'].data.shape}) transformer.set_transpose('data', (2,0,1)) # move image channels to outermost dimension transformer.set_mean('data', mu) # subtract the dataset-mean value in each channel transformer.set_raw_scale('data', 255) # rescale from [0, 1] to [0, 255] transformer.set_channel_swap('data', (2,1,0)) # swap channels from RGB to BGR
mean-subtracted values: [('B', 104.0069879317889), ('G', 116.66876761696767), ('R', 122.6789143406786)]
3. CPU classification
Now we're ready to perform classification. Even though we'll only classify one image, we'll set a batch size of 50 to demonstrate batching.In [6]:
# set the size of the input (we can skip this if we're happy # with the default; we can also change it later, e.g., for different batch sizes) net.blobs['data'].reshape(50, # batch size 3, # 3-channel (BGR) images 227, 227) # image size is 227x227
Load an image (that comes with Caffe) and perform the preprocessing we've set up.
In [7]:
image = caffe.io.load_image(caffe_root + 'examples/images/cat.jpg') transformed_image = transformer.preprocess('data', image) plt.imshow(image)
Out[7]:
<matplotlib.image.AxesImage at 0x7f09693a8c90>
Adorable! Let's classify it!
In [8]:
# copy the image data into the memory allocated for the net net.blobs['data'].data[...] = transformed_image ### perform classification output = net.forward() output_prob = output['prob'][0] # the output probability vector for the first image in the batch print 'predicted class is:', output_prob.argmax()
predicted class is: 281
The net gives us a vector of probabilities; the most probable class was the 281st one. But is that correct? Let's check the ImageNet labels...
In [9]:
# load ImageNet labels
labels_file = caffe_root + 'data/ilsvrc12/synset_words.txt'
if not os.path.exists(labels_file):
!../data/ilsvrc12/get_ilsvrc_aux.sh
labels = np.loadtxt(labels_file, str, delimiter='\t')
print 'output label:', labels[output_prob.argmax()]
output label: n02123045 tabby, tabby cat
"Tabby cat" is correct! But let's also look at other top (but less confident predictions).
In [10]:
# sort top five predictions from softmax output top_inds = output_prob.argsort()[::-1][:5] # reverse sort and take five largest items print 'probabilities and labels:' zip(output_prob[top_inds], labels[top_inds])
probabilities and labels:
Out[10]:
[(0.31243637, 'n02123045 tabby, tabby cat'), (0.2379719, 'n02123159 tiger cat'), (0.12387239, 'n02124075 Egyptian cat'), (0.10075711, 'n02119022 red fox, Vulpes vulpes'), (0.070957087, 'n02127052 lynx, catamount')]
We see that less confident predictions are sensible.
4. Switching to GPU mode
Let's see how long classification took, and compare it to GPU mode.In [11]:
%timeit net.forward()
1 loop, best of 3: 1.42 s per loop
That's a while, even for a batch of 50 images. Let's switch to GPU mode.
In [12]:
caffe.set_device(0) # if we have multiple GPUs, pick the first one
caffe.set_mode_gpu()
net.forward() # run once before timing to set up memory
%timeit net.forward()
10 loops, best of 3: 70.2 ms per loop
That should be much faster!
5. Examining intermediate output
A net is not just a black box; let's take a look at some of the parameters and intermediate activations.First we'll see how to read out the structure of the net in terms of activation and parameter shapes.
For each layer, let's look at the activation shapes, which typically have the form
(batch_size, channel_dim, height, width).
The activations are exposed as an
OrderedDict,
net.blobs.
In [13]:
# for each layer, show the output shape for layer_name, blob in net.blobs.iteritems(): print layer_name + '\t' + str(blob.data.shape)
data (50, 3, 227, 227) conv1 (50, 96, 55, 55) pool1 (50, 96, 27, 27) norm1 (50, 96, 27, 27) conv2 (50, 256, 27, 27) pool2 (50, 256, 13, 13) norm2 (50, 256, 13, 13) conv3 (50, 384, 13, 13) conv4 (50, 384, 13, 13) conv5 (50, 256, 13, 13) pool5 (50, 256, 6, 6) fc6 (50, 4096) fc7 (50, 4096) fc8 (50, 1000) prob (50, 1000)
Now look at the parameter shapes. The parameters are exposed as another
OrderedDict,
net.params. We need to index the resulting
values with either
[0]for weights or
[1]for biases.
The param shapes typically have the form
(output_channels, input_channels, filter_height, filter_width)(for the weights) and the 1-dimensional shape
(output_channels,)(for
the biases).
In [14]:
for layer_name, param in net.params.iteritems(): print layer_name + '\t' + str(param[0].data.shape), str(param[1].data.shape)
conv1 (96, 3, 11, 11) (96,) conv2 (256, 48, 5, 5) (256,) conv3 (384, 256, 3, 3) (384,) conv4 (384, 192, 3, 3) (384,) conv5 (256, 192, 3, 3) (256,) fc6 (4096, 9216) (4096,) fc7 (4096, 4096) (4096,) fc8 (1000, 4096) (1000,)
Since we're dealing with four-dimensional data here, we'll define a helper function for visualizing sets of rectangular heatmaps.
In [15]:
def vis_square(data): """Take an array of shape (n, height, width) or (n, height, width, 3) and visualize each (height, width) thing in a grid of size approx. sqrt(n) by sqrt(n)""" # normalize data for display data = (data - data.min()) / (data.max() - data.min()) # force the number of filters to be square n = int(np.ceil(np.sqrt(data.shape[0]))) padding = (((0, n ** 2 - data.shape[0]), (0, 1), (0, 1)) # add some space between filters + ((0, 0),) * (data.ndim - 3)) # don't pad the last dimension (if there is one) data = np.pad(data, padding, mode='constant', constant_values=1) # pad with ones (white) # tile the filters into an image data = data.reshape((n, n) + data.shape[1:]).transpose((0, 2, 1, 3) + tuple(range(4, data.ndim + 1))) data = data.reshape((n * data.shape[1], n * data.shape[3]) + data.shape[4:]) plt.imshow(data); plt.axis('off')
First we'll look at the first layer filters,
conv1
In [16]:
# the parameters are a list of [weights, biases] filters = net.params['conv1'][0].data vis_square(filters.transpose(0, 2, 3, 1))
The first layer output,
conv1(rectified responses of the filters above, first 36 only)
In [17]:
feat = net.blobs['conv1'].data[0, :36] vis_square(feat)
The fifth layer after pooling,
pool5
In [18]:
feat = net.blobs['pool5'].data[0] vis_square(feat)
The first fully connected layer,
fc6(rectified)
We show the output values and the histogram of the positive values
In [19]:
feat = net.blobs['fc6'].data[0] plt.subplot(2, 1, 1) plt.plot(feat.flat) plt.subplot(2, 1, 2) _ = plt.hist(feat.flat[feat.flat > 0], bins=100)
The final probability output,
prob
In [20]:
feat = net.blobs['prob'].data[0] plt.figure(figsize=(15, 3)) plt.plot(feat.flat)
Out[20]:
[<matplotlib.lines.Line2D at 0x7f09587dfb50>]
Note the cluster of strong predictions; the labels are sorted semantically. The top peaks correspond to the top predicted labels, as shown above.
6. Try your own image
Now we'll grab an image from the web and classify it using the steps above.Try setting
my_image_urlto any JPEG image URL.
In [ ]:
# download an image my_image_url = "..." # paste your URL here # for example: # my_image_url = "https://upload.wikimedia.org/wikipedia/commons/b/be/Orang_Utan%2C_Semenggok_Forest_Reserve%2C_Sarawak%2C_Borneo%2C_Malaysia.JPG" !wget -O image.jpg $my_image_url # transform it and copy it into the net image = caffe.io.load_image('image.jpg') net.blobs['data'].data[...] = transformer.preprocess('data', image) # perform classification net.forward() # obtain the output probabilities output_prob = net.blobs['prob'].data[0] # sort top five predictions from softmax output top_inds = output_prob.argsort()[::-1][:5] plt.imshow(image) print 'probabilities and labels:' zip(output_prob[top_inds], labels[top_inds])
相关文章推荐
- three.js 之旅 (三)
- js面向对象与继承
- 修改chrome下保存用户名密码的默认字体
- 如何插入谷歌地图并获取javascript api 秘钥
- git clone问题: warning: remote HEAD refers to nonexistent ref, unable to checkout
- js的事件委托
- jsp表单提交的中文乱码问题
- HTML5视频播放器<video>和音频播放器<audio>用法
- CSS盒模型全面讲解,怪异模式盒模型,CSS3 box-sizing属性
- phantomjs 另类用法
- HTML5 Server-Sent Events with Java Servlets example
- 剑指offer----丑数
- Ubuntu 下搭建 Node.js环境
- Node.js学习4- 回调函数
- HTML学习笔记1.3-定义文档类型
- JQuery Mobile 应用实例(1)
- jsp中调用servlet路径问题
- js 柯里化函数
- JavaScript强化教程 - 六步实现贪食蛇
- jquery . fancybox()