您的位置:首页 > 编程语言 > Python开发

【python实战】k-近邻算法(2)

2017-08-24 11:58 148 查看
#ax.scatter(datingDataMat[:,1],datingDataMat[:,2])
ax.scatter(datingDataMat[:,1],datingDataMat[:,2],15.0*array(datingLabels),15.0*array(datingLabels))

【scatter使用参考博客:http://blog.csdn.net/anneqiqi/article/details/64125186】 

改成这句代码后,绘图对于不同类别有了颜色和尺寸的不同。

下面进行归一化函数:

def autoNorm(dataSet):
minVals = dataSet.min(0);#dataset二维矩阵,minval取该矩阵每一列(因为(0)所以是列)的最小值
maxVals = dataSet.max(0);
ranges = maxVals - minVals;
normDataSet = zeros(shape(dataSet)) #shape(dataSet)返回两个值表示行列#zeros创建给定类型的矩阵,并初始化为0
m = dataSet.shape[0] #dataset的一维长度
normDataSet = dataSet - tile(minVals, (m,1))#该数组重复M行
normDataSet = normDataSet/tile(ranges,(m,1))#/不意味着矩阵除法
return normDataSet, ranges, minVals
》》》datingDataMat,datingLabels=KNN.file2matrix('datingTestSet2.txt')

》》》 normMat,ranges,minVals=KNN.autoNorm(datingDataMat)

》》》 normMat

array([[ 0.44832535,  0.39805139,  0.56233353],

       [ 0.15873259,  0.34195467,  0.98724416],

       [ 0.28542943,  0.06892523,  0.47449629],

       ..., 

       [ 0.29115949,  0.50910294,  0.51079493],

       [ 0.52711097,  0.43665451,  0.4290048 ],

       [ 0.47940793,  0.3768091 ,  0.78571804]])

》》》 ranges

array([  9.12730000e+04,   2.09193490e+01,   1.69436100e+00])

》》》 minVals

array([ 0.      ,  0.      ,  0.001156])

# 2.2.4 测试代码
def datingClassTest():
hoRatio = 0.10    #测试数据占的百分比
datingDataMat, datingLabels = file2matrix('datingTestSet2.txt')
normMat, ranges, minVals = autoNorm(datingDataMat)
m = normMat.shape[0]
numTestVecs = int(m*hoRatio)
errorCount = 0.0
for i in range(numTestVecs):
classifierResult = classify0(normMat[i,:], normMat[numTestVecs:m,:],datingLabels[numTestVecs:m],3)
print 'the classifier came back with: %d, the real answer is: %d' %(classifierResult, datingLabels[i])
if(classifierResult != datingLabels[i]): errorCount += 1.0
print "the total error rate is: %f " % (errorCount/float(numTestVecs))
>>> import KNN

>>> KNN.datingClassTest()

the classifier came back with: 3, the real answer is: 3

the classifier came back with: 2, the real answer is: 2

the classifier came back with: 1, the real answer is: 1

......

the classifier came back with: 1, the real answer is: 1

the total error rate is: 0.080000 

#2.2.5 输入某人的信息,便得出对对方喜欢程度的预测值
def classifyPerson():
resultList = ['not at all', 'in small doses', 'in large doses'] #0-2不同喜欢程度
percentTats = float(raw_input("percentage of time spent playing video games?"))
ffMiles = float(raw_input("frequent flier miles earned per year?"))
iceCream = float(raw_input("liters of ice cream consumed per year?"))
datingDataMat, datingLabels = file2matrix('datingTestSet2.txt') #测试数据
normMat, ranges, minVals = autoNorm(datingDataMat) #归一化处理的数据矩阵
inArr = array([ffMiles, percentTats, iceCream]) #输入的待判断的人的信息
classifierResult = classify0((inArr - minVals)/ranges, normMat, datingLabels,3)#该信息归一化后代入函数
print 'You will probably like this person: ', resultList[classifierResult - 1]

shell:

>>> import KNN

>>> KNN.classifyPerson()

percentage of time spent playing video games?10

frequent flier miles earned per year?10000

liters of ice cream consumed per year?0.5

You will probably like this person:  in small doses
2.3 手写识别系统

在二进制存储的图像数据上使用KNN

2.3.1 准备数据

#将一个32*32的二进制图像矩阵转换成1*1024的向量  

  

def img2vector(filename):  

    returnVect = zeros((1,1024))
 #创建二维数组#一行1024列

    fr = open(filename)  

    for i in range(32):  

        lineStr = fr.readline()  #每一行

        for j in range(32):
     #每一列

            returnVect[0, 32*i+j] = int(lineStr[j])  

    return returnVect
 

测试:

>>> import KNN 

>>> testvector=KNN.img2vector('digits/testDigits/0_13.txt')

>>> testvector[0,0:31]

array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,

        0.,  1.,  1.,  1.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,

        0.,  0.,  0.,  0.,  0.])

>>> testvector[0,32:63]

array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,

        1.,  1.,  1.,  1.,  1.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,

        0.,  0.,  0.,  0.,  0.])

2.3.2 测试算法

def handwritingClassTest():

    hwLabels = []

    trainingFileList = listdir('digits/trainingDigits')           #load the training set

    m = len(trainingFileList)

    trainingMat = zeros((m,1024))

    for i in range(m):

        fileNameStr = trainingFileList[i]

        fileStr = fileNameStr.split('.')[0]     #take off .txt

        classNumStr = int(fileStr.split('_')[0])

        hwLabels.append(classNumStr)

        trainingMat[i,:] = img2vector('digits/trainingDigits/%s' % fileNameStr)

    testFileList = listdir('digits/testDigits')        #iterate through the test set

    errorCount = 0.0

    mTest = len(testFileList)

    for i in range(mTest):

        fileNameStr = testFileList[i]

        fileStr = fileNameStr.split('.')[0]     #take off .txt

        classNumStr = int(fileStr.split('_')[0])

        vectorUnderTest = img2vector('digits/testDigits/%s' % fileNameStr)

        classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3)

        print "the classifier came back with: %d, the real answer is: %d" % (classifierResult, classNumStr)

        if (classifierResult != classNumStr): errorCount += 1.0

    print "\nthe total number of errors is: %d" % errorCount

    print "\nthe total error rate is: %f" % (errorCount/float(mTest))

优:

K近邻是分类数据最简单最有效的算法。

缺点:

1.算法执行效率不高。2.需要2M存储空间存放全部训练数据集。【总结:时间、空间消耗大。】

3.无法给出任何数据的基础结构信息,因此也无法知晓平均实例样本和典型实例样本具有什么特征。

下一章我们将使用概率测量方法处理分类问题,该算法可以解决这个问题。
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  机器学习 算法 numpy