您的位置：首页 > 编程语言 > Python开发

【python实战】k-近邻算法（2）

2017-08-24 11:58 148 查看

#ax.scatter(datingDataMat[:,1],datingDataMat[:,2])
ax.scatter(datingDataMat[:,1],datingDataMat[:,2],15.0*array(datingLabels),15.0*array(datingLabels))

【scatter使用参考博客：http://blog.csdn.net/anneqiqi/article/details/64125186】

改成这句代码后，绘图对于不同类别有了颜色和尺寸的不同。

下面进行归一化函数：

def autoNorm(dataSet):
minVals = dataSet.min(0);#dataset二维矩阵，minval取该矩阵每一列（因为（0）所以是列）的最小值
maxVals = dataSet.max(0);
ranges = maxVals - minVals;
normDataSet = zeros(shape(dataSet)) #shape(dataSet)返回两个值表示行列#zeros创建给定类型的矩阵，并初始化为0
m = dataSet.shape[0] #dataset的一维长度
normDataSet = dataSet - tile(minVals, (m,1))#该数组重复M行
normDataSet = normDataSet/tile(ranges,(m,1))#/不意味着矩阵除法
return normDataSet, ranges, minVals

》》》datingDataMat,datingLabels=KNN.file2matrix('datingTestSet2.txt')

》》》 normMat,ranges,minVals=KNN.autoNorm(datingDataMat)

》》》 normMat

array([[ 0.44832535, 0.39805139, 0.56233353],

[ 0.15873259, 0.34195467, 0.98724416],

[ 0.28542943, 0.06892523, 0.47449629],

...,

[ 0.29115949, 0.50910294, 0.51079493],

[ 0.52711097, 0.43665451, 0.4290048 ],

[ 0.47940793, 0.3768091 , 0.78571804]])

》》》 ranges

array([ 9.12730000e+04, 2.09193490e+01, 1.69436100e+00])

》》》 minVals

array([ 0. , 0. , 0.001156])

# 2.2.4 测试代码
def datingClassTest():
hoRatio = 0.10    #测试数据占的百分比
datingDataMat, datingLabels = file2matrix('datingTestSet2.txt')
normMat, ranges, minVals = autoNorm(datingDataMat)
m = normMat.shape[0]
numTestVecs = int(m*hoRatio)
errorCount = 0.0
for i in range(numTestVecs):
classifierResult = classify0(normMat[i,:], normMat[numTestVecs:m,:],datingLabels[numTestVecs:m],3)
print 'the classifier came back with: %d, the real answer is: %d' %(classifierResult, datingLabels[i])
if(classifierResult != datingLabels[i]): errorCount += 1.0
print "the total error rate is: %f " % (errorCount/float(numTestVecs))

>>> import KNN

>>> KNN.datingClassTest()

the classifier came back with: 3, the real answer is: 3

the classifier came back with: 2, the real answer is: 2

the classifier came back with: 1, the real answer is: 1

......

the classifier came back with: 1, the real answer is: 1

the total error rate is: 0.080000

#2.2.5 输入某人的信息，便得出对对方喜欢程度的预测值
def classifyPerson():
resultList = ['not at all', 'in small doses', 'in large doses'] #0-2不同喜欢程度
percentTats = float(raw_input("percentage of time spent playing video games?"))
ffMiles = float(raw_input("frequent flier miles earned per year?"))
iceCream = float(raw_input("liters of ice cream consumed per year?"))
datingDataMat, datingLabels = file2matrix('datingTestSet2.txt') #测试数据
normMat, ranges, minVals = autoNorm(datingDataMat) #归一化处理的数据矩阵
inArr = array([ffMiles, percentTats, iceCream]) #输入的待判断的人的信息
classifierResult = classify0((inArr - minVals)/ranges, normMat, datingLabels,3)#该信息归一化后代入函数
print 'You will probably like this person: ', resultList[classifierResult - 1]

shell:

>>> import KNN

>>> KNN.classifyPerson()

percentage of time spent playing video games?10

frequent flier miles earned per year?10000

liters of ice cream consumed per year?0.5

You will probably like this person: in small doses
2.3 手写识别系统

在二进制存储的图像数据上使用KNN

2.3.1 准备数据

#将一个32*32的二进制图像矩阵转换成1*1024的向量



def img2vector(filename):

    returnVect = zeros((1,1024))
#创建二维数组#一行1024列

    fr = open(filename)

    for i in range(32):

        lineStr = fr.readline() #每一行

        for j in range(32):
#每一列

            returnVect[0, 32*i+j] = int(lineStr[j])

    return returnVect

测试：

>>> import KNN

>>> testvector=KNN.img2vector('digits/testDigits/0_13.txt')

>>> testvector[0,0:31]

array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,

0., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0.,

0., 0., 0., 0., 0.])

>>> testvector[0,32:63]

array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,

1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0.,

0., 0., 0., 0., 0.])

2.3.2 测试算法

def handwritingClassTest():

hwLabels = []

trainingFileList = listdir('digits/trainingDigits') #load the training set

m = len(trainingFileList)

trainingMat = zeros((m,1024))

for i in range(m):

fileNameStr = trainingFileList[i]

fileStr = fileNameStr.split('.')[0] #take off .txt

classNumStr = int(fileStr.split('_')[0])

hwLabels.append(classNumStr)

trainingMat[i,:] = img2vector('digits/trainingDigits/%s' % fileNameStr)

testFileList = listdir('digits/testDigits') #iterate through the test set

errorCount = 0.0

mTest = len(testFileList)

for i in range(mTest):

fileNameStr = testFileList[i]

fileStr = fileNameStr.split('.')[0] #take off .txt

classNumStr = int(fileStr.split('_')[0])

vectorUnderTest = img2vector('digits/testDigits/%s' % fileNameStr)

classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3)

print "the classifier came back with: %d, the real answer is: %d" % (classifierResult, classNumStr)

if (classifierResult != classNumStr): errorCount += 1.0

print "\nthe total number of errors is: %d" % errorCount

print "\nthe total error rate is: %f" % (errorCount/float(mTest))

优：

K近邻是分类数据最简单最有效的算法。

缺点：

1.算法执行效率不高。2.需要2M存储空间存放全部训练数据集。【总结：时间、空间消耗大。】

3.无法给出任何数据的基础结构信息，因此也无法知晓平均实例样本和典型实例样本具有什么特征。

下一章我们将使用概率测量方法处理分类问题，该算法可以解决这个问题。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： 机器学习算法 numpy

相关文章推荐

新的分享

章节导航