机器学习(4)之Logistic回归
2014-09-11 23:17
513 查看
机器学习(4)之Logistic回归
1. 算法推导
与之前学过的梯度下降等不同,Logistic回归是一类分类问题,而前者是回归问题。回归问题中,尝试预测的变量y是连续的变量,而在分类问题中,y是一组离散的,比如y只能取{0,1}。假设一组样本为这样如图所示,如果需要用线性回归来拟合这些样本,匹配效果会很不好。对于这种y值只有{0,1}这种情况的,可以使用分类方法进行。
假设
,且使得
其中定义Logistic函数(又名sigmoid函数):
下图是Logistic函数g(z)的分布曲线,当z大时候g(z)趋向1,当z小的时候g(z)趋向0,z=0时候g(z)=0.5,因此将g(z)控制在{0,1}之间。其他的g(z)函数只要是在{0,1}之间就同样可以,但是后续的章节会讲到,现在所使用的sigmoid函数是最常用的
假设给定x以为参数的y=1和y=0的概率:
可以简写成:
假设m个训练样本都是独立的,那么θ的似然函数可以写成:
对L(θ)求解对数最大似然值:
为了使似然性最大化,类似于线性回归使用梯度下降的方法,求对数似然性对
的偏导,即:
注意:之前的梯度下降算法的公式为
。这是是梯度上升,Θ:=Θ的含义就是前后两次迭代(或者说前后两个样本)的变化值为l(Θ)的导数。
则
即类似上节课的随机梯度上升算法,形式上和线性回归是相同的,只是符号相反,
为logistic函数,但实质上和线性回归是不同的学习算法。
2. 改进的Logistic回归算法
评价一个优化算法的优劣主要是看它是否收敛,也就是说参数是否达到稳定值,是否还会不断的变化?收敛速度是否快?上图展示了随机梯度下降算法在200次迭代中(请先看第三和第四节再回来看这里。我们的数据库有100个二维样本,每个样本都对系数调整一次,所以共有200*100=20000次调整)三个回归系数的变化过程。其中系数X2经过50次迭代就达到了稳定值。但系数X1和X0到100次迭代后稳定。而且可恨的是系数X1和X2还在很调皮的周期波动,迭代次数很大了,心还停不下来。产生这个现象的原因是存在一些无法正确分类的样本点,也就是我们的数据集并非线性可分,但我们的logistic regression是线性分类模型,对非线性可分情况无能为力。然而我们的优化程序并没能意识到这些不正常的样本点,还一视同仁的对待,调整系数去减少对这些样本的分类误差,从而导致了在每次迭代时引发系数的剧烈改变。对我们来说,我们期待算法能避免来回波动,从而快速稳定和收敛到某个值。
对随机梯度下降算法,我们做两处改进来避免上述的波动问题:
1)在每次迭代时,调整更新步长alpha的值。随着迭代的进行,alpha越来越小,这会缓解系数的高频波动(也就是每次迭代系数改变得太大,跳的跨度太大)。当然了,为了避免alpha随着迭代不断减小到接近于0(这时候,系数几乎没有调整,那么迭代也没有意义了),我们约束alpha一定大于一个稍微大点的常数项,具体见代码。
2)每次迭代,改变样本的优化顺序。也就是随机选择样本来更新回归系数。这样做可以减少周期性的波动,因为样本顺序的改变,使得每次迭代不再形成周期性。
改进的随机梯度下降算法的伪代码如下:
################################################
初始化回归系数为1
重复下面步骤直到收敛{
对随机遍历的数据集中的每个样本
随着迭代的逐渐进行,减小alpha的值
计算该样本的梯度
使用alpha x gradient来更新回归系数
}
返回回归系数值
################################################
比较原始的随机梯度下降和改进后的梯度下降,可以看到两点不同:
1)系数不再出现周期性波动。2)系数可以很快的稳定下来,也就是快速收敛。这里只迭代了20次就收敛了。而上面的随机梯度下降需要迭代200次才能稳定。
3. python实现
# coding=utf-8 #!/usr/bin/python #Filename:LogisticRegression.py ''' Created on 2014年9月13日 @author: Ryan C. F. ''' from numpy import * import matplotlib.pyplot as plt import time def sigmoid(inX): ''' simoid 函数 ''' return 1.0 / (1 + exp(-inX)) def trainLogRegres(train_x, train_y, opts): ''' train a logistic regression model using some optional optimize algorithm input: train_x is a mat datatype, each row stands for one sample train_y is mat datatype too, each row is the corresponding label opts is optimize option include step and maximum number of iterations ''' # calculate training time startTime = time.time() numSamples, numFeatures = shape(train_x) alpha = opts['alpha']; maxIter = opts['maxIter'] weights = ones((numFeatures, 1)) # optimize through gradient ascent algorilthm for k in range(maxIter): if opts['optimizeType'] == 'gradAscent': # gradient ascent algorilthm output = sigmoid(train_x * weights) error = train_y - output weights = weights + alpha * train_x.transpose() * error elif opts['optimizeType'] == 'stocGradAscent': # stochastic gradient ascent for i in range(numSamples): output = sigmoid(train_x[i, :] * weights) error = train_y[i, 0] - output weights = weights + alpha * train_x[i, :].transpose() * error elif opts['optimizeType'] == 'smoothStocGradAscent': # smooth stochastic gradient ascent # randomly select samples to optimize for reducing cycle fluctuations dataIndex = range(numSamples) for i in range(numSamples): alpha = 4.0 / (1.0 + k + i) + 0.01 randIndex = int(random.uniform(0, len(dataIndex))) output = sigmoid(train_x[randIndex, :] * weights) error = train_y[randIndex, 0] - output weights = weights + alpha * train_x[randIndex, :].transpose() * error del(dataIndex[randIndex]) # during one interation, delete the optimized sample else: raise NameError('Not support optimize method type!') print 'Congratulations, training complete! Took %fs!' % (time.time() - startTime) return weights # test your trained Logistic Regression model given test set def testLogRegres(weights, test_x, test_y): numSamples, numFeatures = shape(test_x) matchCount = 0 for i in xrange(numSamples): predict = sigmoid(test_x[i, :] * weights)[0, 0] > 0.5 if predict == bool(test_y[i, 0]): matchCount += 1 accuracy = float(matchCount) / numSamples return accuracy # show your trained logistic regression model only available with 2-D data def showLogRegres(weights, train_x, train_y): # notice: train_x and train_y is mat datatype numSamples, numFeatures = shape(train_x) if numFeatures != 3: print "Sorry! I can not draw because the dimension of your data is not 2!" return 1 # draw all samples for i in xrange(numSamples): if int(train_y[i, 0]) == 0: plt.plot(train_x[i, 1], train_x[i, 2], 'or') elif int(train_y[i, 0]) == 1: plt.plot(train_x[i, 1], train_x[i, 2], 'ob') # draw the classify line min_x = min(train_x[:, 1])[0, 0] max_x = max(train_x[:, 1])[0, 0] weights = weights.getA() # convert mat to array y_min_x = float(-weights[0] - weights[1] * min_x) / weights[2] y_max_x = float(-weights[0] - weights[1] * max_x) / weights[2] plt.plot([min_x, max_x], [y_min_x, y_max_x], '-g') plt.xlabel('X1'); plt.ylabel('X2') plt.show()
# coding=utf-8 #!/usr/bin/python #Filename:testLogisticRegression.py ''' Created on 2014年9月13日 @author: Ryan C. F. ''' from numpy import * import matplotlib.pyplot as plt import time from LogisticRegression import * def loadData(): train_x = [] train_y = [] fileIn = open('/Users/rcf/workspace/java/workspace/MachineLinearing/src/supervisedLearning/trains.txt') for line in fileIn.readlines(): lineArr = line.strip().split() train_x.append([1.0, float(lineArr[0]), float(lineArr[1])]) train_y.append(float(lineArr[2])) return mat(train_x), mat(train_y).transpose() ## step 1: load data print "step 1: load data..." train_x, train_y = loadData() test_x = train_x; test_y = train_y print train_x print train_y ## step 2: training... print "step 2: training..." opts = {'alpha': 0.001, 'maxIter': 100, 'optimizeType': 'smoothStocGradAscent'} optimalWeights = trainLogRegres(train_x, train_y, opts) ## step 3: testing print "step 3: testing..." accuracy = testLogRegres(optimalWeights, test_x, test_y) ## step 4: show the result print "step 4: show the result..." print 'The classify accuracy is: %.3f%%' % (accuracy * 100) showLogRegres(optimalWeights, train_x, train_y)
-0.017612 14.053064 0 -1.395634 4.662541 1 -0.752157 6.538620 0 -1.322371 7.152853 0 0.423363 11.054677 0 0.406704 7.067335 1 0.667394 12.741452 0 -2.460150 6.866805 1 0.569411 9.548755 0 -0.026632 10.427743 0 0.850433 6.920334 1 1.347183 13.175500 0 1.176813 3.167020 1 -1.781871 9.097953 0 -0.566606 5.749003 1 0.931635 1.589505 1 -0.024205 6.151823 1 -0.036453 2.690988 1 -0.196949 0.444165 1 1.014459 5.754399 1 1.985298 3.230619 1 -1.693453 -0.557540 1 -0.576525 11.778922 0 -0.346811 -1.678730 1 -2.124484 2.672471 1 1.217916 9.597015 0 -0.733928 9.098687 0 -3.642001 -1.618087 1 0.315985 3.523953 1 1.416614 9.619232 0 -0.386323 3.989286 1 0.556921 8.294984 1 1.224863 11.587360 0 -1.347803 -2.406051 1 1.196604 4.951851 1 0.275221 9.543647 0 0.470575 9.332488 0 -1.889567 9.542662 0 -1.527893 12.150579 0 -1.185247 11.309318 0 -0.445678 3.297303 1 1.042222 6.105155 1 -0.618787 10.320986 0 1.152083 0.548467 1 0.828534 2.676045 1 -1.237728 10.549033 0 -0.683565 -2.166125 1 0.229456 5.921938 1 -0.959885 11.555336 0 0.492911 10.993324 0 0.184992 8.721488 0 -0.355715 10.325976 0 -0.397822 8.058397 0 0.824839 13.730343 0 1.507278 5.027866 1 0.099671 6.835839 1 -0.344008 10.717485 0 1.785928 7.718645 1 -0.918801 11.560217 0 -0.364009 4.747300 1 -0.841722 4.119083 1 0.490426 1.960539 1 -0.007194 9.075792 0 0.356107 12.447863 0 0.342578 12.281162 0 -0.810823 -1.466018 1 2.530777 6.476801 1 1.296683 11.607559 0 0.475487 12.040035 0 -0.783277 11.009725 0 0.074798 11.023650 0 -1.337472 0.468339 1 -0.102781 13.763651 0 -0.147324 2.874846 1 0.518389 9.887035 0 1.015399 7.571882 0 -1.658086 -0.027255 1 1.319944 2.171228 1 2.056216 5.019981 1 -0.851633 4.375691 1 -1.510047 6.061992 0 -1.076637 -3.181888 1 1.821096 10.283990 0 3.010150 8.401766 1 -1.099458 1.688274 1 -0.834872 -1.733869 1 -0.846637 3.849075 1 1.400102 12.628781 0 1.752842 5.468166 1 0.078557 0.059736 1 0.089392 -0.715300 1 1.825662 12.693808 0 0.197445 9.744638 0 0.126117 0.922311 1 -0.679797 1.220530 1 0.677983 2.556666 1 0.761349 10.693862 0 -2.168791 0.143632 1 1.388610 9.341997 0 0.317029 14.739025 0
最后查看下训练结果:
(a) 批梯度上升(迭代100次)(准确率90%) (b)随机梯度下降(迭代100次)(准确率90%) (c)改进的随机梯度下降 (迭代100次)(准确率93%)
(e) 批梯度上升(迭代1000次)(准确率97%) (d)随机梯度下降(迭代1000次)(准确率97%) (f)改进的随机梯度下降 (迭代1000次)(准确率95%)
4. 逻辑回归与线性回归的区别
详见:http://blog.csdn.net/viewcode/article/details/8794401 后续学完一般线性回归再进行总结。相关文章推荐
- Stanford机器学习---第三讲. 逻辑回归和过拟合问题的解决 logistic Regression & Regularization
- 机器学习——分类算法4:Logistic回归 梯度上升 思想 和 代码解释
- 机器学习---白话Logistic回归
- Stanford机器学习---第三讲. 逻辑回归和过拟合问题的解决 logistic Regression & Regularization
- 机器学习系列(2):logistic回归,贝叶斯(bayes)方法
- Stanford机器学习---第三讲. 逻辑回归和过拟合问题的解决 logistic Regression & Regularization
- Python机器学习(二):Logistic回归建模分类实例——信用卡欺诈监测(上)
- (转载)机器学习--Logistic回归-我看你像谁
- 【机器学习实战】-Logistic 回归
- Stanford机器学习---第三讲. 逻辑回归和过拟合问题的解决 logistic Regression & Regularization
- 机器学习实战Logistic回归笔记
- 机器学习第四节,Logistic 回归
- opencv3/C++ 机器学习-逻辑回归/Logistic Regression
- 机器学习4logistic回归
- 机器学习之logistic回归的梯度上升算法
- 机器学习笔记—Logistic 回归
- 机器学习之logistic回归与分类
- Python机器学习(四):logistic回归
- 机器学习(五):Logistic回归
- 【机器学习笔记1】Logistic回归总结