您的位置:首页 > 编程语言 > Python开发

python 机器学习之kNN算法

2015-08-15 21:40 731 查看


1、KNN分类算法

KNN分类算法(K-Nearest-Neighbors Classification),又叫K近邻算法,是一个概念极其简单,而分类效果又很优秀的分类算法。

他的核心思想就是,要确定测试样本属于哪一类,就寻找所有训练样本中与该测试样本“距离”最近的前K个样本,然后看这K个样本大部分属于哪一类,那么就认为这个测试样本也属于哪一类。简单的说就是让最相似的K个样本来投票决定。

这里所说的距离,一般最常用的就是多维空间的欧式距离。这里的维度指特征维度,即样本有几个特征就属于几维。

KNN示意图如下所示。(图片来源:百度百科http://baike.baidu.com/view/1485833.htm?from_id=3479559&type=syn&fromtitle=knn&fr=aladdin



上图中要确定测试样本绿色属于蓝色还是红色。

显然,当K=3时,将以1:2的投票结果分类于红色;而K=5时,将以3:2的投票结果分类于蓝色。
kNN算法的计算步骤如下所示:

1)计算已知类别数据集中的点与当前点之间的距离;

2)按照距离递增次序排序;

3)选取与当前点距离最小的k个点;

4)确定前k个点所在类别的出现频率;

5)返回前k个点出现频率最高的类别作为当前点的预测分类。
2、python实现
python实现kNN有两种方法,一种是自己根据kNN算法的定义编写函数来完成,根据机器学习实战这本书中的思路,以手写字符的识别为例,用python实现kNN算法。除了给出手写字符的识别,我们还给出了一个小的测试程序,方便进行手工验算体会kNN的运算过程。
新建一个名为kNN.py的文件,写入如下代码:
from numpy import *
import operator
import os
import sys
# coding=utf-8
# classify using kNN

#kNN的测试程序
# create a dataset which contains 4 samples with 2 classes
def createDataSet():
# create a matrix: each row as a sample
group = array([[1.0, 0.9], [1.0, 1.0], [0.1, 0.2], [0.0, 0.1]])
labels = ['A', 'A', 'B', 'B'] # four samples and two classes
return group, labels

def kNNClassify(newInput, dataSet, labels, k):
numSamples = dataSet.shape[0] # shape[0] stands for the num of row

## step 1: calculate Euclidean distance
# tile(A, reps): Construct an array by repeating A reps times
# the following copy numSamples rows for dataSet
diff = tile(newInput, (numSamples, 1)) - dataSet # Subtract element-wise
squaredDiff = diff ** 2 # squared for the subtract
squaredDist = sum(squaredDiff, axis = 1) # sum is performed by row
distance = squaredDist ** 0.5

## step 2: sort the distance
# argsort() returns the indices that would sort an array in a ascending order
sortedDistIndices = argsort(distance)

classCount = {} # define a dictionary (can be append element)
for i in xrange(k):
## step 3: choose the min k distance
voteLabel = labels[sortedDistIndices[i]]

## step 4: count the times labels occur
# when the key voteLabel is not in dictionary classCount, get()
# will return 0
classCount[voteLabel] = classCount.get(voteLabel, 0) + 1

## step 5: the max voted class will return
maxCount = 0
for key, value in classCount.items():
if value > maxCount:
maxCount = value
maxIndex = key

return maxIndex

# convert image to vector
def  img2vector(filename):
rows = 32
cols = 32
imgVector = zeros((1, rows * cols))
fileIn = open(filename)
for row in xrange(rows):
lineStr = fileIn.readline()
for col in xrange(cols):
imgVector[0, row * 32 + col] = int(lineStr[col]) #the picture in fact consist of a serious of numbers,so just read it wil be ok
return imgVector

# load dataSet
def loadDataSet():
## step 1: Getting training set
print "---Getting training set..."
dataSetDir = 'D:/python/kNN/'
# dataSetDir can also get easily in such way,when move the file location the exe can still works
#dataSetDir=sys.path[0]
trainingFileList = os.listdir(dataSetDir + 'trainingDigits') # load the training set
numSamples = len(trainingFileList)                       #get the number of training samples

train_x = zeros((numSamples, 1024))
train_y = []
for i in xrange(numSamples):
filename = trainingFileList[i]

# get train_x
train_x[i, :] = img2vector(dataSetDir + 'trainingDigits/%s' % filename)

# get label from file name such as "1_18.txt"
label = int(filename.split('_')[0]) # return 1
train_y.append(label)

## step 2: Getting testing set
print "---Getting testing set..."
testingFileList = os.listdir(dataSetDir + 'testDigits') # load the testing set
numSamples = len(testingFileList)
test_x = zeros((numSamples, 1024))
test_y = []
for i in xrange(numSamples):
filename = testingFileList[i]

# get train_x
test_x[i, :] = img2vector(dataSetDir + 'testDigits/%s' % filename)

# get label from file name such as "1_18.txt"
label = int(filename.split('_')[0]) # return 1
test_y.append(label)

return train_x, train_y, test_x, test_y

# test hand writing class
def testHandWritingClass():
## step 1: load data
print "step 1: load data..."
train_x, train_y, test_x, test_y = loadDataSet()

## step 2: training...
print "step 2: training..."
pass

## step 3: testing
print "step 3: testing..."
numTestSamples = test_x.shape[0]
matchCount = 0
for i in xrange(numTestSamples):
predict = kNNClassify(test_x[i], train_x, train_y, 3)
if predict == test_y[i]:
matchCount += 1
accuracy = float(matchCount) / numTestSamples

## step 4: show the result
print "step 4: show the result..."
print 'The classify accuracy is: %.2f%%' % (accuracy * 100)
代码中做了比较详细的注释,然后新建一个名为test_kNN.py的文件,写入如下代码进行测试:
import kNN
from numpy import *
#1.test few numbers
dataSet, labels = kNN.createDataSet()

testX = array([1.2, 1.1])
k = 3
outputLabel = kNN.kNNClassify(testX, dataSet, labels, 3)
print "Your input is:", testX, "and classified to class: ", outputLabel

testX = array([0.1, 0.3])
outputLabel = kNN.kNNClassify(testX, dataSet, labels, 3)
print "Your input is:", testX, "and classified to class: ", outputLabel

#2.test handwriting
kNN.testHandWritingClass()
3.sklearn 机器学习模块kNN算法
以电影分类数据为例

电影名称打斗次数接吻次数电影类型
California Man

3104Romance
He’s Not Really into Dudes

2100Romance
Beautiful Woman

181Romance
Kevin Longblade

10110Action
Robo Slayer 3000

995Action
Amped II

982Action
未知1890Unknown
利用sklearn模块,进行kNN分类的代码如下:
#coding=utf-8
import numpy as np
from sklearn import neighbors
knn = neighbors.classification.KNeighborsClassifier()
data = np.array([[3,104],[2,100],[1,81],[101,10],[99,5],[98,2]]) #create input data into arrays
print data
labels = np.array([1,1,1,2,2,2])
train=knn.fit(data,labels)
print train   #get the informatino of the classifier
result=knn.predict([18,90])
print result
直接运行该文件,可以查看结果。

参考文章;
1. http://blog.csdn.net/zouxy09/article/details/16955347 2. http://blog.csdn.net/zhaoyl03/article/details/8666906 3. http://blog.csdn.net/lsldd/article/details/41357931
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: