您的位置：首页 > 其它

CS231n课程作业(一) SVM classifier

2018-01-10 15:21 423 查看

一、理论知识

1. score function

将原始数据映射到每一类上计算得分的函数。

(PS: CNN也是将原始输入的像素映射成类目得分，只不过其中间映射更加复杂，参数更多。)

对于SVM，score function为：

其中，xi指第i张图片，W为权重(可以理解为图像上每个像素点都有weight)，b为偏移量bias。

这里可以看出，优化时不仅要调整W，还需调整b。为了简化优化，将b并入到W中，形成一个新的W，这样优化时只需对W优化即可。处理方法如下：

因此，新的score function为：

2. loss function

用于描述模型预测值与真实值的不一致程度。

loss function is made of two component: data loss and regularization loss.

data loss为：

其中，Li表示第i张图片的data loss， sj指第i张图片对第j类的score，syi指第i张图片所属正确类的score。这里的1其实是一个给定的delta值，也可以设为其他常数，delta的含义是正确类的score不仅理应比其他类的score高，而且应该高出delta，否则将产生data loss。

训练模型的目的就是让loss值最小，即optimization。

Optimization： start with a random W and find a W that minimizes the loss.

但是，有时多组W可以得到相同的data loss值，比如：

这样W就是不确定的。因此，需要通过正则化选取一组最合适的W。

Regularization loss :(in common use: L2 regularization)

因此，完整的loss function为：

其中，N为训练样本的数量，lambda为正则项系数。lambda由cross-validation确定。(还有learning rate)

3. 梯度相关知识

梯度：指loss function对W求导，这样可以得到函数下降最快的方向。

numerical gradient: 即从求导的定义出发

analytic gradient: 直接用Li对W求偏导

numerical gradient is easy to write but slow and approximate, and analytic gradient is fast and exact but error-prone. In practice, derive analytic gradient, then check your implementation with numerical gradient.

这是data loss 的展开形式：

其中，Li指第i张图片的data loss值，Wj指W矩阵的第j列(即权重矩阵对第j类的权)，Wyi指W矩阵的第yi列(权重矩阵对第i张图片的正确所属类的权)。本作业中：W:(3073,10), X:(N,3073), y:(N,)

这里变量是W，求analytic gradient：

(更新：下面为错解，正解见下面的更新)

———————————————————————2018-1-10更新——————————————————————

————————————————————————–更新完毕———————————————————————–

*注：这两种梯度求法在编写svm classifier中很关键！

梯度下降： 梯度下降即通过反复的迭代运算，计算gradient进行权值更新，最后停留在loss function的低值点。(此时的W是使loss function最低的W。)

作一个直观理解：比如你半夜小便，眼睛睁不开，不开灯，对着马桶开始飙尿。你的目的是尽快(因为尿量总共就那么多)尿到马桶中有水的区域。所以你开始randomly尿到一侧马桶内壁，然后不断的下降落尿点。这里你需要get到最快的下降方向，也就是gradient，沿着这个梯度负方向下降一定的步长(step_size或learning rate)，一点一点的尿到马桶的最低点。此时对卫生间的环境损失最小(minimize the loss function)，然后你按下冲水按钮，就可以安心睡觉了。

4. 迭代相关知识

batch_size: number of training examples to use at each iteration.

迭代的具体工作：

(1) 取batch_size个样本，evaluate the loss & gradient.

(2) update W: W += -learning_rate * gradient

(3) 利用更新好的W，重复(1)、(2)，直到迭代结束。

Mini-batch Gradient Descent: In practice, only use a small portion of the training set to compute the gradient of loss function.

缺点：得出的gradient很有可能是一个噪声

优点：使用小样本计算可以提高运算速度，可以计算更多的次数来evaluate gradient，从而得出最低的loss function。或者也可以降低运算循环次数，使用更加精确的gradient。另外，在实际中使用所有数据计算梯度并不可行。(GPU hold不住)

learning_rate: 每一次迭代中W沿梯度负方向移动的大小(见上面“迭代的具体工作”)，lr决定了参数移动到最优值的速度快慢。lr的值由cross-validation确定。(还有正则项系数lambda也是)

if very high: loss function不会收敛，会越来越大。

if high: loss function会收敛，但可能收敛后的函数值会越过最优值，卡在一个比较高的位置，即没有找到最低的loss function。由于更新速度快，你不能确定要优化的函数是否进入一个局部最小值，你以为局部最小值是最小值，但实际并不是。

if low: 更新速度慢，收敛需要很长时间。

if good: then good.

策略：先set a high lr，再一点点降低。

二、SVM classifier

1. 数据介绍

50000张图片，32x32x3， 10类。其中training set为49000张， test set为1000张。

从training set中产生validation set 1000张和development set 500张。

X_train: (49000,3072) , y_train: (49000,)

X_val: (1000,3072) , y_val: (1000,)

X_test: (1000,3072) , y_test: (1000,)

X_dev: (500,3072) , y_dev: (500,)

*以上3072在运算中均增加为3073，原因不解释(除非你没好好看上面的理论部分)

W: (3073,10)

dW: (3073,10)

2. 预处理: 去均值

这样的处理其实很常见，例如摄影测量里的重心化。

first: compute the image mean based on the training data

second: subtract the mean image from train and test data

third: append the bias dimension of ones (i.e. bias trick) so that our SVM only has to worry about optimizing a single weight matrix W.

3. 计算loss和梯度

这里的梯度是analytic gradient

方法一 naive版：

def svm_loss_naive(W, X, y, reg):
"""
Structured SVM loss function, naive implementation (with loops).

Inputs have dimension D, there are C classes, and we operate on minibatches
of N examples.

Inputs:
- W: A numpy array of shape (D, C) containing weights.
- X: A numpy array of shape (N, D) containing a minibatch of data.
- y: A numpy array of shape (N,) containing training labels; y[i] = c means
that X[i] has label c, where 0 <= c < C.
- reg: (float) regularization strength

Returns a tuple of:
- loss as single float
- gradient with respect to weights W; an array of same shape as W
"""
dW = np.zeros(W.shape) # initialize the gradient as zero (3073,10)

# compute the loss and the gradient
num_classes = W.shape[1]    #10类
num_train = X.shape[0]      # 500个训练样本
loss = 0.0
for i in xrange(num_train):
scores = X[i].dot(W)        # (1,10)
correct_class_score = scores[y[i]] #第y[i]类的得分 或 第i个样本的所属类别的得分
for j in xrange(num_classes):
if j == y[i]:
continue            #跳出当前循环，执行下一次循环
margin = scores[j] - correct_class_score + 1 # note delta = 1
if margin > 0:
loss += margin          #计算data loss
dW[:,j] += X[i].T       # dW: 3073x10, 此处的值为3073x1,放在dW第j列，dW[:,j]只需加一次，外循环再循环500次
dW[:,y[i]] += -X[i].T   #dW[:,y[i]要加9次]，除非有margin<0;外循环再循环500次
# Right now the loss is a sum over all training examples, but we want it
# to be an average instead so we divide by num_train.
loss /= num_train         #由于样本太多，loss值累积，所以取均值
dW /= num_train
# Add regularization to the loss.
loss += reg * np.sum(W * W)       # L2 regularization
dW += reg * W
#############################################################################
# TODO:                                                                     #
# Compute the gradient of the loss function and store it dW.                #
# Rather that first computing the loss and then computing the derivative,   #
# it may be simpler to compute the derivative at the same time that the     #
# loss is being computed. As a result you may need to modify some of the    #
# code above to compute the gradient.                                       #
#############################################################################

return loss, dW

方法二矢量化版：

def svm_loss_vectorized(W, X, y, reg):
"""
Structured SVM loss function, vectorized implementation.

Inputs and outputs are the same as svm_loss_naive.
"""
loss = 0.0
dW = np.zeros(W.shape) # initialize the gradient as zero

#############################################################################
# TODO:                                                                     #
# Implement a vectorized version of the structured SVM loss, storing the    #
# result in loss.                                                           #
#############################################################################
scores = X.dot(W)
correct = []
for i in xrange(X.shape[0]):
correct.append(scores[i,y[i]])
#或： correct = scores[range(500),list(y)]
correct = np.mat(correct).T
margin = np.maximum(0,scores - correct + 1)
loss = np.sum(margin) + np.sum(correct) - X.shape[0]  #将上面naive中j==y[i]多减多加的部分还原
#处理2：也可以将j==y[i]的项置0
#即：margin[range(500),list(y)]=0
loss = loss/X.shape[0]
loss += reg * np.sum(W * W)
#############################################################################
#                             END OF YOUR CODE                              #
#############################################################################

#############################################################################
# TODO:                                                                     #
# Implement a vectorized version of the gradient for the structured SVM     #
# loss, storing the result in dW.                                           #
#                                                                           #
# Hint: Instead of computing the gradient from scratch, it may be easier    #
# to reuse some of the intermediate values that you used to compute the     #
# loss.                                                                     #
#############################################################################
coeff = np.zeros(margin.shape) #500x10
coeff[margin>0]=1  #置1的目的是dWj这一项，让所有样本的Xi相加;而margin<0时不计入loss，所以置0
coeff[range(margin.shape[0]),list(y)] = 0 #将j==y[i]的项=0，因为求loss function时要求j!=y[i],所以这一项置0
coeff[range(margin.shape[0]),list(y)] = -np.sum(coeff,axis=1) #等号后面是(500,);dWyi要计算9次，除非有margin<0的情况
dW = (X.T).dot(coeff)
dW = dW/margin.shape[0] + reg*W
#############################################################################
#                             END OF YOUR CODE                              #
#############################################################################

return loss, dW

4. 检查梯度

用numerical gradient来检查上面求的analytic gradient

def grad_check_sparse(f, x, analytic_grad, num_checks=10, h=1e-5):
"""
sample a few random elements and only return numerical
in this dimensions.
"""
'''
@author: mckee    2018-1-8
f: lambda函数，返回loss
x: W (3073,10)
analytic_grad: grad
'''
for i in xrange(num_checks):
ix = tuple([randrange(m) for m in x.shape]) #x.shape = (3073,10)
print ('ix:',ix)
oldval = x[ix]
x[ix] = oldval + h # increment by h
fxph = f(x) # evaluate f(x + h)
x[ix] = oldval - h # increment by h
fxmh = f(x) # evaluate f(x - h)
x[ix] = oldval # reset

grad_numerical = (fxph - fxmh) / (2 * h)
grad_analytic = analytic_grad[ix]
rel_error = abs(grad_numerical - grad_analytic) / (abs(grad_numerical) + abs(grad_analytic))
print('numerical: %f analytic: %f, relative error: %e' % (grad_numerical, grad_analytic, rel_error))

5. Stochastic Gradient Descent

更新权值和loss

def train(self, X, y, learning_rate=1e-3, reg=1e-5, num_iters=100,
batch_size=200, verbose=False):
"""
Train this linear classifier using stochastic gradient descent.

Inputs:
- X: A numpy array of shape (N, D) containing training data; there are N
training samples each of dimension D.
- y: A numpy array of shape (N,) containing training labels; y[i] = c
means that X[i] has label 0 <= c < C for C classes.
- learning_rate: (float) learning rate for optimization.
- reg: (float) regularization strength.
- num_iters: (integer) number of steps to take when optimizing
- batch_size: (integer) number of training examples to use at each step.
- verbose: (boolean) If true, print progress during optimization.

Outputs:
A list containing the value of the loss function at each training iteration.
"""
num_train, dim = X.shape
num_classes = np.max(y) + 1 # assume y takes values 0...K-1 where K is number of classes
if self.W is None:
# lazily initialize W
print ('initialize W')
self.W = 0.001 * np.random.randn(dim, num_classes)

# Run stochastic gradient descent to optimize W
loss_history = []
for it in xrange(num_iters):
X_batch = None
y_batch = None

#########################################################################
# TODO:                                                                 #
# Sample batch_size elements from the training data and their           #
# corresponding labels to use in this round of gradient descent.        #
# Store the data in X_batch and their corresponding labels in           #
# y_batch; after sampling X_batch should have shape (dim, batch_size)   #
# and y_batch should have shape (batch_size,)                           #
#                                                                       #
# Hint: Use np.random.choice to generate indices. Sampling with         #
# replacement is faster than sampling without replacement.              #
#########################################################################
mask = np.random.choice(num_train,batch_size,replace = False)
X_batch = X[mask]
y_batch = y[mask]
#########################################################################
#                       END OF YOUR CODE                                #
#########################################################################

# evaluate loss and gradient
loss, grad = self.loss(X_batch, y_batch, reg)
loss_history.append(loss)

# perform parameter update
#########################################################################
# TODO:                                                                 #
# Update the weights using the gradient and the learning rate.          #
#########################################################################
self.W += -learning_rate * grad
#########################################################################
#                       END OF YOUR CODE                                #
#########################################################################
if it == 0:
print('initial loss:',loss)
if verbose and it % 100 == 0:
print('iteration %d / %d: loss %f' % (it, num_iters, loss))
if it == num_iters - 1:
print('iteration %d / %d: loss %f' % (it, num_iters, loss))

return loss_history

6. 评定精度

def predict(self, X):
"""
Use the trained weights of this linear classifier to predict labels for
data points.

Inputs:
- X: A numpy array of shape (N, D) containing training data; there are N
training samples each of dimension D.

Returns:
- y_pred: Predicted labels for the data in X. y_pred is a 1-dimensional
array of length N, and each element is an integer giving the predicted
class.
"""
y_pred = np.zeros(X.shape[0])
###########################################################################
# TODO:                                                                   #
# Implement this method. Store the predicted labels in y_pred.            #
###########################################################################
score = X.dot(self.W)
#print (score.shape)
#print (score[0].shape)
index = np.zeros(X.shape[0])
index = np.argmax(score,axis = 1)
y_pred = index
###########################################################################
#                           END OF YOUR CODE                              #
###########################################################################
return y_pred

得到预测label后，计算精度：

y_train_pred = svm.predict(X_train)
print('training accuracy: %f' % (np.mean(y_train == y_train_pred), ))#这样计算非常简洁、优雅
y_val_pred = svm.predict(X_val)
print('validation accuracy: %f' % (np.mean(y_val == y_val_pred), ))

三、涉及到的numpy重要函数

1.np.mean(matrix,axis)

功能：求均值，第二个参数可选

2.np.random.randn(a,b)

功能：从标准正态分布中返回一个或多个样本值，一个参数则为返回的样本个数，2个参数则为矩阵

3. np.maximum(a,b)

功能：比较a和b的大小

4.np.stack()、 np.vstack()、 np.hstack()

功能：数组、矩阵的堆叠、拼接

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航