您的位置：首页 > 其它

机器学习实战--chapter 5 Logistic Regression(二）疝气预测马死亡

2017-03-23 00:10 204 查看

1 LR模型与算法

LR模型与算法原理参见上一篇博客。

2 模型场景

注：疝气：描述马胃肠病的术语。

得到一批数据集，包含了368个样本，大部分有tag，每个样本的特征28个，通过这些训练出一个模型。下次我们只要输入一些特征，就能预测马屁是否会死亡。

3 准备数据

数据收集已经完成，接下来准备数据，主要处理数据的类型，数据是否缺失，异常。

对于缺失的数据：

1. 可用特征均值代替；

2. 特殊值，如-1；

3. 删去；

4. 相似样本的均值；

5. 使用别的Machine Learning方法预测缺失值；

具体到本次：

1. 缺失值必须是数值，用0代替

优点：

1） 0不会影响模型系数的更新；

2） sigmoid(0)=0.5,所以0对结果预测没有倾向性；**

3）由于该数据集一般特征不为0，所以0有特殊值的含义。

若数据集中tag丢失，则丢弃该样本。

4 代码

from numpy import *

#sigmoid函数
def sigmoid(inX):
return 1.0/(1+exp(-inX))

def stocGradAscent1(dataMatrix, classLabels, numIter=150):
m,n = shape(dataMatrix)
weights = ones(n)   #initialize to all ones
for j in range(numIter):
dataIndex = range(m)
for i in range(m):
alpha = 4/(1.0+j+i)+0.0001    #apha 步长下降，先快后慢
randIndex = int(random.uniform(0,len(dataIndex)))#go to 0 because of the constant
h = sigmoid(sum(dataMatrix[randIndex]*weights))
error = classLabels[randIndex] - h
weights = weights + alpha * error * dataMatrix[randIndex]
del(dataIndex[randIndex])
return weights

# 0.5判决
def classifyVector(inX, weights):
prob = sigmoid(sum(inX*weights))
if prob > 0.5: return 1.0
else: return 0.0

# 训练和验证
def colicTest():
frTrain = open('horseColicTraining.txt'); frTest = open('horseColicTest.txt')
trainingSet = []; trainingLabels = []
for line in frTrain.readlines(): #构建训练集
currLine = line.strip().split('\t')
lineArr =[]
for i in range(21):
lineArr.append(float(currLine[i]))
trainingSet.append(lineArr)
trainingLabels.append(float(currLine[21]))
trainWeights = stocGradAscent1(array(trainingSet), trainingLabels, 1000) #采用随机梯度改进1得到模型系数
errorCount = 0; numTestVec = 0.0
for line in frTest.readlines():#对每个测试样本预测，并与样本tag对比，计数错误次数
numTestVec += 1.0
currLine = line.strip().split('\t')
lineArr =[]
for i in range(21):
lineArr.append(float(currLine[i]))
if int(classifyVector(array(lineArr), trainWeights))!= int(currLine[21]):
errorCount += 1
errorRate = (float(errorCount)/numTestVec) #错误率，错误次数除以总测试样本
print "the error rate of this test is: %f" % errorRate
return errorRate

#多次验证，得到平均值
def multiTest():
numTests = 10; errorSum=0.0
for k in range(numTests):
errorSum += colicTest()
print "after %d iterations the average error rate is: %f" % (numTests, errorSum/float(numTests))

def main1():
multiTest()

if __name__ == '__main__':
main1()

5 运行结果

/Users/tl/.pyenv/versions/2.7.13ML/bin/python /Users/tl/Works/MLiA/machinelearninginaction/Ch05/horse.py
/Users/tl/Works/MLiA/machinelearninginaction/Ch05/horse.py:4: RuntimeWarning: overflow encountered in exp
return 1.0/(1+exp(-inX))
the error rate of this test is: 0.298507
the error rate of this test is: 0.373134
the error rate of this test is: 0.298507
the error rate of this test is: 0.388060
the err
a524
or rate of this test is: 0.313433
the error rate of this test is: 0.358209
the error rate of this test is: 0.328358
the error rate of this test is: 0.358209
the error rate of this test is: 0.283582
the error rate of this test is: 0.343284
after 10 iterations the average error rate is: 0.334328

注：其中

RuntimeWarning: overflow encountered in exp

是sigmoid函数溢出

6 参考

机器学习实战

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： 机器学习

相关文章推荐

新的分享

章节导航