CS231n课程作业(一) SVM classifier
2018-01-10 15:21
423 查看
一、理论知识
1. score function
将原始数据映射到每一类上计算得分的函数。(PS: CNN也是将原始输入的像素映射成类目得分,只不过其中间映射更加复杂,参数更多。)
对于SVM,score function为:
其中,xi指第i张图片,W为权重(可以理解为图像上每个像素点都有weight),b为偏移量bias。
这里可以看出,优化时不仅要调整W,还需调整b。为了简化优化,将b并入到W中,形成一个新的W,这样优化时只需对W优化即可。处理方法如下:
因此,新的score function为:
2. loss function
用于描述模型预测值与真实值的不一致程度。loss function is made of two component: data loss and regularization loss.
data loss为:
其中,Li表示第i张图片的data loss, sj指第i张图片对第j类的score,syi指第i张图片所属正确类的score。这里的1其实是一个给定的delta值,也可以设为其他常数,delta的含义是正确类的score不仅理应比其他类的score高,而且应该高出delta,否则将产生data loss。
训练模型的目的就是让loss值最小,即optimization。
Optimization: start with a random W and find a W that minimizes the loss.
但是,有时多组W可以得到相同的data loss值,比如:
这样W就是不确定的。因此,需要通过正则化选取一组最合适的W。
Regularization loss :(in common use: L2 regularization)
因此,完整的loss function为:
其中,N为训练样本的数量,lambda为正则项系数。lambda由cross-validation确定。(还有learning rate)
3. 梯度相关知识
梯度:指loss function对W求导,这样可以得到函数下降最快的方向。numerical gradient: 即从求导的定义出发
analytic gradient: 直接用Li对W求偏导
numerical gradient is easy to write but slow and approximate, and analytic gradient is fast and exact but error-prone. In practice, derive analytic gradient, then check your implementation with numerical gradient.
这是data loss 的展开形式:
其中,Li指第i张图片的data loss值,Wj指W矩阵的第j列(即权重矩阵对第j类的权),Wyi指W矩阵的第yi列(权重矩阵对第i张图片的正确所属类的权)。本作业中:W:(3073,10), X:(N,3073), y:(N,)
这里变量是W,求analytic gradient:
(更新:下面为错解,正解见下面的更新)
———————————————————————2018-1-10更新——————————————————————
————————————————————————–更新完毕———————————————————————–
*注:这两种梯度求法在编写svm classifier中很关键!
梯度下降: 梯度下降即通过反复的迭代运算,计算gradient进行权值更新,最后停留在loss function的低值点。(此时的W是使loss function最低的W。)
作一个直观理解:比如你半夜小便,眼睛睁不开,不开灯,对着马桶开始飙尿。你的目的是尽快(因为尿量总共就那么多)尿到马桶中有水的区域。所以你开始randomly尿到一侧马桶内壁,然后不断的下降落尿点。这里你需要get到最快的下降方向,也就是gradient,沿着这个梯度负方向下降一定的步长(step_size或learning rate),一点一点的尿到马桶的最低点。此时对卫生间的环境损失最小(minimize the loss function),然后你按下冲水按钮,就可以安心睡觉了。
4. 迭代相关知识
batch_size: number of training examples to use at each iteration.迭代的具体工作:
(1) 取batch_size个样本,evaluate the loss & gradient.
(2) update W: W += -learning_rate * gradient
(3) 利用更新好的W,重复(1)、(2),直到迭代结束。
Mini-batch Gradient Descent: In practice, only use a small portion of the training set to compute the gradient of loss function.
缺点:得出的gradient很有可能是一个噪声
优点:使用小样本计算可以提高运算速度,可以计算更多的次数来evaluate gradient,从而得出最低的loss function。或者也可以降低运算循环次数,使用更加精确的gradient。另外,在实际中使用所有数据计算梯度并不可行。(GPU hold不住)
learning_rate: 每一次迭代中W沿梯度负方向移动的大小(见上面“迭代的具体工作”),lr决定了参数移动到最优值的速度快慢。lr的值由cross-validation确定。(还有正则项系数lambda也是)
if very high: loss function不会收敛,会越来越大。
if high: loss function会收敛,但可能收敛后的函数值会越过最优值,卡在一个比较高的位置,即没有找到最低的loss function。由于更新速度快,你不能确定要优化的函数是否进入一个局部最小值,你以为局部最小值是最小值,但实际并不是。
if low: 更新速度慢,收敛需要很长时间。
if good: then good.
策略:先set a high lr,再一点点降低。
二、SVM classifier
1. 数据介绍
50000张图片,32x32x3, 10类。其中training set为49000张, test set为1000张。从training set中产生validation set 1000张和development set 500张。
X_train: (49000,3072) , y_train: (49000,)
X_val: (1000,3072) , y_val: (1000,)
X_test: (1000,3072) , y_test: (1000,)
X_dev: (500,3072) , y_dev: (500,)
*以上3072在运算中均增加为3073,原因不解释(除非你没好好看上面的理论部分)
W: (3073,10)
dW: (3073,10)
2. 预处理: 去均值
这样的处理其实很常见,例如摄影测量里的重心化。first: compute the image mean based on the training data
second: subtract the mean image from train and test data
third: append the bias dimension of ones (i.e. bias trick) so that our SVM only has to worry about optimizing a single weight matrix W.
3. 计算loss和梯度
这里的梯度是analytic gradient方法一 naive版:
def svm_loss_naive(W, X, y, reg): """ Structured SVM loss function, naive implementation (with loops). Inputs have dimension D, there are C classes, and we operate on minibatches of N examples. Inputs: - W: A numpy array of shape (D, C) containing weights. - X: A numpy array of shape (N, D) containing a minibatch of data. - y: A numpy array of shape (N,) containing training labels; y[i] = c means that X[i] has label c, where 0 <= c < C. - reg: (float) regularization strength Returns a tuple of: - loss as single float - gradient with respect to weights W; an array of same shape as W """ dW = np.zeros(W.shape) # initialize the gradient as zero (3073,10) # compute the loss and the gradient num_classes = W.shape[1] #10类 num_train = X.shape[0] # 500个训练样本 loss = 0.0 for i in xrange(num_train): scores = X[i].dot(W) # (1,10) correct_class_score = scores[y[i]] #第y[i]类的得分 或 第i个样本的所属类别的得分 for j in xrange(num_classes): if j == y[i]: continue #跳出当前循环,执行下一次循环 margin = scores[j] - correct_class_score + 1 # note delta = 1 if margin > 0: loss += margin #计算data loss dW[:,j] += X[i].T # dW: 3073x10, 此处的值为3073x1,放在dW第j列,dW[:,j]只需加一次,外循环再循环500次 dW[:,y[i]] += -X[i].T #dW[:,y[i]要加9次],除非有margin<0;外循环再循环500次 # Right now the loss is a sum over all training examples, but we want it # to be an average instead so we divide by num_train. loss /= num_train #由于样本太多,loss值累积,所以取均值 dW /= num_train # Add regularization to the loss. loss += reg * np.sum(W * W) # L2 regularization dW += reg * W ############################################################################# # TODO: # # Compute the gradient of the loss function and store it dW. # # Rather that first computing the loss and then computing the derivative, # # it may be simpler to compute the derivative at the same time that the # # loss is being computed. As a result you may need to modify some of the # # code above to compute the gradient. # ############################################################################# return loss, dW
方法二 矢量化版:
def svm_loss_vectorized(W, X, y, reg): """ Structured SVM loss function, vectorized implementation. Inputs and outputs are the same as svm_loss_naive. """ loss = 0.0 dW = np.zeros(W.shape) # initialize the gradient as zero ############################################################################# # TODO: # # Implement a vectorized version of the structured SVM loss, storing the # # result in loss. # ############################################################################# scores = X.dot(W) correct = [] for i in xrange(X.shape[0]): correct.append(scores[i,y[i]]) #或: correct = scores[range(500),list(y)] correct = np.mat(correct).T margin = np.maximum(0,scores - correct + 1) loss = np.sum(margin) + np.sum(correct) - X.shape[0] #将上面naive中j==y[i]多减多加的部分还原 #处理2:也可以将j==y[i]的项置0 #即:margin[range(500),list(y)]=0 loss = loss/X.shape[0] loss += reg * np.sum(W * W) ############################################################################# # END OF YOUR CODE # ############################################################################# ############################################################################# # TODO: # # Implement a vectorized version of the gradient for the structured SVM # # loss, storing the result in dW. # # # # Hint: Instead of computing the gradient from scratch, it may be easier # # to reuse some of the intermediate values that you used to compute the # # loss. # ############################################################################# coeff = np.zeros(margin.shape) #500x10 coeff[margin>0]=1 #置1的目的是dWj这一项,让所有样本的Xi相加;而margin<0时不计入loss,所以置0 coeff[range(margin.shape[0]),list(y)] = 0 #将j==y[i]的项=0,因为求loss function时要求j!=y[i],所以这一项置0 coeff[range(margin.shape[0]),list(y)] = -np.sum(coeff,axis=1) #等号后面是(500,);dWyi要计算9次,除非有margin<0的情况 dW = (X.T).dot(coeff) dW = dW/margin.shape[0] + reg*W ############################################################################# # END OF YOUR CODE # ############################################################################# return loss, dW
4. 检查梯度
用numerical gradient来检查上面求的analytic gradientdef grad_check_sparse(f, x, analytic_grad, num_checks=10, h=1e-5): """ sample a few random elements and only return numerical in this dimensions. """ ''' @author: mckee 2018-1-8 f: lambda函数,返回loss x: W (3073,10) analytic_grad: grad ''' for i in xrange(num_checks): ix = tuple([randrange(m) for m in x.shape]) #x.shape = (3073,10) print ('ix:',ix) oldval = x[ix] x[ix] = oldval + h # increment by h fxph = f(x) # evaluate f(x + h) x[ix] = oldval - h # increment by h fxmh = f(x) # evaluate f(x - h) x[ix] = oldval # reset grad_numerical = (fxph - fxmh) / (2 * h) grad_analytic = analytic_grad[ix] rel_error = abs(grad_numerical - grad_analytic) / (abs(grad_numerical) + abs(grad_analytic)) print('numerical: %f analytic: %f, relative error: %e' % (grad_numerical, grad_analytic, rel_error))
5. Stochastic Gradient Descent
更新权值和lossdef train(self, X, y, learning_rate=1e-3, reg=1e-5, num_iters=100, batch_size=200, verbose=False): """ Train this linear classifier using stochastic gradient descent. Inputs: - X: A numpy array of shape (N, D) containing training data; there are N training samples each of dimension D. - y: A numpy array of shape (N,) containing training labels; y[i] = c means that X[i] has label 0 <= c < C for C classes. - learning_rate: (float) learning rate for optimization. - reg: (float) regularization strength. - num_iters: (integer) number of steps to take when optimizing - batch_size: (integer) number of training examples to use at each step. - verbose: (boolean) If true, print progress during optimization. Outputs: A list containing the value of the loss function at each training iteration. """ num_train, dim = X.shape num_classes = np.max(y) + 1 # assume y takes values 0...K-1 where K is number of classes if self.W is None: # lazily initialize W print ('initialize W') self.W = 0.001 * np.random.randn(dim, num_classes) # Run stochastic gradient descent to optimize W loss_history = [] for it in xrange(num_iters): X_batch = None y_batch = None ######################################################################### # TODO: # # Sample batch_size elements from the training data and their # # corresponding labels to use in this round of gradient descent. # # Store the data in X_batch and their corresponding labels in # # y_batch; after sampling X_batch should have shape (dim, batch_size) # # and y_batch should have shape (batch_size,) # # # # Hint: Use np.random.choice to generate indices. Sampling with # # replacement is faster than sampling without replacement. # ######################################################################### mask = np.random.choice(num_train,batch_size,replace = False) X_batch = X[mask] y_batch = y[mask] ######################################################################### # END OF YOUR CODE # ######################################################################### # evaluate loss and gradient loss, grad = self.loss(X_batch, y_batch, reg) loss_history.append(loss) # perform parameter update ######################################################################### # TODO: # # Update the weights using the gradient and the learning rate. # ######################################################################### self.W += -learning_rate * grad ######################################################################### # END OF YOUR CODE # ######################################################################### if it == 0: print('initial loss:',loss) if verbose and it % 100 == 0: print('iteration %d / %d: loss %f' % (it, num_iters, loss)) if it == num_iters - 1: print('iteration %d / %d: loss %f' % (it, num_iters, loss)) return loss_history
6. 评定精度
def predict(self, X): """ Use the trained weights of this linear classifier to predict labels for data points. Inputs: - X: A numpy array of shape (N, D) containing training data; there are N training samples each of dimension D. Returns: - y_pred: Predicted labels for the data in X. y_pred is a 1-dimensional array of length N, and each element is an integer giving the predicted class. """ y_pred = np.zeros(X.shape[0]) ########################################################################### # TODO: # # Implement this method. Store the predicted labels in y_pred. # ########################################################################### score = X.dot(self.W) #print (score.shape) #print (score[0].shape) index = np.zeros(X.shape[0]) index = np.argmax(score,axis = 1) y_pred = index ########################################################################### # END OF YOUR CODE # ########################################################################### return y_pred
得到预测label后,计算精度:
y_train_pred = svm.predict(X_train) print('training accuracy: %f' % (np.mean(y_train == y_train_pred), ))#这样计算非常简洁、优雅 y_val_pred = svm.predict(X_val) print('validation accuracy: %f' % (np.mean(y_val == y_val_pred), ))
三、涉及到的numpy重要函数
1.np.mean(matrix,axis)功能:求均值,第二个参数可选
2.np.random.randn(a,b)
功能:从标准正态分布中返回一个或多个样本值,一个参数则为返回的样本个数,2个参数则为矩阵
3. np.maximum(a,b)
功能:比较a和b的大小
4.np.stack()、 np.vstack()、 np.hstack()
功能:数组、矩阵的堆叠、拼接
相关文章推荐
- cs231n 课程作业 Assignment 2
- CS231n课程作业(一) Two-layer Neural Network
- cs231n课程作业1——二层神经网络分类器感悟
- 李飞飞CS231n课程-中文笔记(包括课后作业要求)翻译汇总
- CS231n课程作业(一) kNN classifier
- CS231n课程作业(二) Multi-Layer Net、BN、Dropout、Optimization
- cs231n课程作业assignment1(Softmax)
- 李飞飞CS231n课程-中文笔记(包括课后作业要求)翻译汇总
- cs231n 课程作业 Assignment 1
- 如何在本地完成CS231n课程作业
- cs231n课程作业assignment1(KNN)
- CS231n课程作业(一) Softmax classifier
- CS231n(2):课程作业# 1简介
- cs231n课程课件、作业以及课程笔记
- cs231n课程作业assignment1(SVM)
- CS231n(3):课程作业# 2简介
- cs231n 课程作业 Assignment 3
- CS231n(4):课程作业# 3简介
- CS231n课程作业(二) CNN
- 第一次作业:软件工程的实践项目课程的自我目标