【python实战】k-近邻算法(2)
2017-08-24 11:58
148 查看
#ax.scatter(datingDataMat[:,1],datingDataMat[:,2])
ax.scatter(datingDataMat[:,1],datingDataMat[:,2],15.0*array(datingLabels),15.0*array(datingLabels))
【scatter使用参考博客:http://blog.csdn.net/anneqiqi/article/details/64125186】
改成这句代码后,绘图对于不同类别有了颜色和尺寸的不同。
下面进行归一化函数:
》》》 normMat,ranges,minVals=KNN.autoNorm(datingDataMat)
》》》 normMat
array([[ 0.44832535, 0.39805139, 0.56233353],
[ 0.15873259, 0.34195467, 0.98724416],
[ 0.28542943, 0.06892523, 0.47449629],
...,
[ 0.29115949, 0.50910294, 0.51079493],
[ 0.52711097, 0.43665451, 0.4290048 ],
[ 0.47940793, 0.3768091 , 0.78571804]])
》》》 ranges
array([ 9.12730000e+04, 2.09193490e+01, 1.69436100e+00])
》》》 minVals
array([ 0. , 0. , 0.001156])
>>> KNN.datingClassTest()
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 1, the real answer is: 1
......
the classifier came back with: 1, the real answer is: 1
the total error rate is: 0.080000
shell:
>>> import KNN
>>> KNN.classifyPerson()
percentage of time spent playing video games?10
frequent flier miles earned per year?10000
liters of ice cream consumed per year?0.5
You will probably like this person: in small doses
2.3 手写识别系统
在二进制存储的图像数据上使用KNN
2.3.1 准备数据
#将一个32*32的二进制图像矩阵转换成1*1024的向量
def img2vector(filename):
returnVect = zeros((1,1024))
#创建二维数组#一行1024列
fr = open(filename)
for i in range(32):
lineStr = fr.readline() #每一行
for j in range(32):
#每一列
returnVect[0, 32*i+j] = int(lineStr[j])
return returnVect
测试:
>>> import KNN
>>> testvector=KNN.img2vector('digits/testDigits/0_13.txt')
>>> testvector[0,0:31]
array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0.])
>>> testvector[0,32:63]
array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0.])
2.3.2 测试算法
def handwritingClassTest():
hwLabels = []
trainingFileList = listdir('digits/trainingDigits') #load the training set
m = len(trainingFileList)
trainingMat = zeros((m,1024))
for i in range(m):
fileNameStr = trainingFileList[i]
fileStr = fileNameStr.split('.')[0] #take off .txt
classNumStr = int(fileStr.split('_')[0])
hwLabels.append(classNumStr)
trainingMat[i,:] = img2vector('digits/trainingDigits/%s' % fileNameStr)
testFileList = listdir('digits/testDigits') #iterate through the test set
errorCount = 0.0
mTest = len(testFileList)
for i in range(mTest):
fileNameStr = testFileList[i]
fileStr = fileNameStr.split('.')[0] #take off .txt
classNumStr = int(fileStr.split('_')[0])
vectorUnderTest = img2vector('digits/testDigits/%s' % fileNameStr)
classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3)
print "the classifier came back with: %d, the real answer is: %d" % (classifierResult, classNumStr)
if (classifierResult != classNumStr): errorCount += 1.0
print "\nthe total number of errors is: %d" % errorCount
print "\nthe total error rate is: %f" % (errorCount/float(mTest))
优:
K近邻是分类数据最简单最有效的算法。
缺点:
1.算法执行效率不高。2.需要2M存储空间存放全部训练数据集。【总结:时间、空间消耗大。】
3.无法给出任何数据的基础结构信息,因此也无法知晓平均实例样本和典型实例样本具有什么特征。
下一章我们将使用概率测量方法处理分类问题,该算法可以解决这个问题。
ax.scatter(datingDataMat[:,1],datingDataMat[:,2],15.0*array(datingLabels),15.0*array(datingLabels))
【scatter使用参考博客:http://blog.csdn.net/anneqiqi/article/details/64125186】
改成这句代码后,绘图对于不同类别有了颜色和尺寸的不同。
下面进行归一化函数:
def autoNorm(dataSet): minVals = dataSet.min(0);#dataset二维矩阵,minval取该矩阵每一列(因为(0)所以是列)的最小值 maxVals = dataSet.max(0); ranges = maxVals - minVals; normDataSet = zeros(shape(dataSet)) #shape(dataSet)返回两个值表示行列#zeros创建给定类型的矩阵,并初始化为0 m = dataSet.shape[0] #dataset的一维长度 normDataSet = dataSet - tile(minVals, (m,1))#该数组重复M行 normDataSet = normDataSet/tile(ranges,(m,1))#/不意味着矩阵除法 return normDataSet, ranges, minVals》》》datingDataMat,datingLabels=KNN.file2matrix('datingTestSet2.txt')
》》》 normMat,ranges,minVals=KNN.autoNorm(datingDataMat)
》》》 normMat
array([[ 0.44832535, 0.39805139, 0.56233353],
[ 0.15873259, 0.34195467, 0.98724416],
[ 0.28542943, 0.06892523, 0.47449629],
...,
[ 0.29115949, 0.50910294, 0.51079493],
[ 0.52711097, 0.43665451, 0.4290048 ],
[ 0.47940793, 0.3768091 , 0.78571804]])
》》》 ranges
array([ 9.12730000e+04, 2.09193490e+01, 1.69436100e+00])
》》》 minVals
array([ 0. , 0. , 0.001156])
# 2.2.4 测试代码 def datingClassTest(): hoRatio = 0.10 #测试数据占的百分比 datingDataMat, datingLabels = file2matrix('datingTestSet2.txt') normMat, ranges, minVals = autoNorm(datingDataMat) m = normMat.shape[0] numTestVecs = int(m*hoRatio) errorCount = 0.0 for i in range(numTestVecs): classifierResult = classify0(normMat[i,:], normMat[numTestVecs:m,:],datingLabels[numTestVecs:m],3) print 'the classifier came back with: %d, the real answer is: %d' %(classifierResult, datingLabels[i]) if(classifierResult != datingLabels[i]): errorCount += 1.0 print "the total error rate is: %f " % (errorCount/float(numTestVecs))>>> import KNN
>>> KNN.datingClassTest()
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 1, the real answer is: 1
......
the classifier came back with: 1, the real answer is: 1
the total error rate is: 0.080000
#2.2.5 输入某人的信息,便得出对对方喜欢程度的预测值 def classifyPerson(): resultList = ['not at all', 'in small doses', 'in large doses'] #0-2不同喜欢程度 percentTats = float(raw_input("percentage of time spent playing video games?")) ffMiles = float(raw_input("frequent flier miles earned per year?")) iceCream = float(raw_input("liters of ice cream consumed per year?")) datingDataMat, datingLabels = file2matrix('datingTestSet2.txt') #测试数据 normMat, ranges, minVals = autoNorm(datingDataMat) #归一化处理的数据矩阵 inArr = array([ffMiles, percentTats, iceCream]) #输入的待判断的人的信息 classifierResult = classify0((inArr - minVals)/ranges, normMat, datingLabels,3)#该信息归一化后代入函数 print 'You will probably like this person: ', resultList[classifierResult - 1]
shell:
>>> import KNN
>>> KNN.classifyPerson()
percentage of time spent playing video games?10
frequent flier miles earned per year?10000
liters of ice cream consumed per year?0.5
You will probably like this person: in small doses
2.3 手写识别系统
在二进制存储的图像数据上使用KNN
2.3.1 准备数据
#将一个32*32的二进制图像矩阵转换成1*1024的向量
def img2vector(filename):
returnVect = zeros((1,1024))
#创建二维数组#一行1024列
fr = open(filename)
for i in range(32):
lineStr = fr.readline() #每一行
for j in range(32):
#每一列
returnVect[0, 32*i+j] = int(lineStr[j])
return returnVect
测试:
>>> import KNN
>>> testvector=KNN.img2vector('digits/testDigits/0_13.txt')
>>> testvector[0,0:31]
array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0.])
>>> testvector[0,32:63]
array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0.])
2.3.2 测试算法
def handwritingClassTest():
hwLabels = []
trainingFileList = listdir('digits/trainingDigits') #load the training set
m = len(trainingFileList)
trainingMat = zeros((m,1024))
for i in range(m):
fileNameStr = trainingFileList[i]
fileStr = fileNameStr.split('.')[0] #take off .txt
classNumStr = int(fileStr.split('_')[0])
hwLabels.append(classNumStr)
trainingMat[i,:] = img2vector('digits/trainingDigits/%s' % fileNameStr)
testFileList = listdir('digits/testDigits') #iterate through the test set
errorCount = 0.0
mTest = len(testFileList)
for i in range(mTest):
fileNameStr = testFileList[i]
fileStr = fileNameStr.split('.')[0] #take off .txt
classNumStr = int(fileStr.split('_')[0])
vectorUnderTest = img2vector('digits/testDigits/%s' % fileNameStr)
classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3)
print "the classifier came back with: %d, the real answer is: %d" % (classifierResult, classNumStr)
if (classifierResult != classNumStr): errorCount += 1.0
print "\nthe total number of errors is: %d" % errorCount
print "\nthe total error rate is: %f" % (errorCount/float(mTest))
优:
K近邻是分类数据最简单最有效的算法。
缺点:
1.算法执行效率不高。2.需要2M存储空间存放全部训练数据集。【总结:时间、空间消耗大。】
3.无法给出任何数据的基础结构信息,因此也无法知晓平均实例样本和典型实例样本具有什么特征。
下一章我们将使用概率测量方法处理分类问题,该算法可以解决这个问题。
相关文章推荐
- python机器学习实战1:实现k-近邻算法
- 机器学习实战python3 K近邻(KNN)算法实现
- 机器学习实战之k-近邻算法(2)---python简单版
- 【机器学习实战-python3】k-近邻算法
- 机器学习实战(1) ——K-近邻算法(python实现)
- Python 学习笔记(Machine Learning In Action)K-近邻算法(KNN)机器学习实战
- Python机器学习实战--(k-近邻算法)
- 【python实战】k-近邻算法(1)
- 机器学习之K-近邻算法(Python描述)实战百维万组数据
- 机器学习实战之k-近邻算法
- 【机器学习实战-python3】使用Apriori算法进行关联 分析
- 机器学习—K近邻,KD树算法python实现
- Python实现k-近邻算法
- python k-近邻算法实例分享
- 机器学习实战之k-近邻算法(3)---如何可视化数据
- 机器学习实战笔记之二(k-近邻算法)
- 机器学习实战——使用K-近邻算法进行约会配对
- k-近邻算法-python实现
- 机器学习之K-近邻算法(Python描述)基础
- 机器学习 & python 使用k-近邻算法改进约会网站的配对效果