您的位置：首页 > 其它

机器学习实战笔记（1）——kNN（k Nearest Neighbor）算法

2017-07-26 14:10 555 查看

简述

kNN算法（中文翻译：k-近邻算法）是机器学习分类算法的基础部分，也是比较简单的算法之一。它的内容和原理并不复杂，但是计算量比较大，即时间复杂度和空间复杂度都比较高。书中以约会网站和手写数字识别系统为例。在这里，笔者也将从这两个例子下手，但是对部分代码进行了改进，以便适应Python3的编程环境。

算法描述

kNN的k指的是在新数据与样本数据进行比对时，只选取前k个最相近的数据。

kNN算法就是对未知类别属性的数据集中的每个点依次执行以下操作：

计算已知类别数据集中的点与当前点之间的距离（欧氏距离：d=(xA−xB)2+(yA−yB)2−−−−−−−−−−−−−−−−−−−√

）；

按照距离递增次序排序；

选取与当前点距离最小的k个点；

确定前k个点所在类别的出现频率；

返回前k个点出现频率最高的类别作为当前点的预测分类。

特点

优点：精度高，对异常值不敏感，无数据输入假定。

缺点：计算复杂度较高，空间复杂度较高。

适用范围：数值型和标称型。数据需带有目标数据，即人工标签。标签形式可以是文件名，也可以是文档内的某一列。

算法处理一般流程

收集数据

准备数据：结构化的数据格式，有自己的数据格式即可。

分析数据

训练算法：此步骤不适用于kNN，但是为了明确一般流程，仍然加上。

测试算法：计算错误率。

使用算法：首先输入样本数据和结构化的输出结果，然后运行kNN算法判定数据分别属于哪一个分类，最后应用于分类的后续处理。

module

from numpy import *     # numpy matrix and array process
import operator         # sorted() function's 'key'parameter
from os import listdir  # used to list the folder files

收集、解析数据

以文本文件的数据为例，提取其中的矩阵数据（一般以二维数据居多）和标签信息。

def file2matrix(filename):
"""
txt file data change to matrix
@param filename: filename
@return: the read_matrix and the labels
"""
with open(filename, mode='r') as fr:
array_lines = fr.readlines()
number_of_lines = len(array_lines)
return_mat = zeros((number_of_lines, 3))
class_label_vector = []
index = 0
for line in array_lines:
line = line.strip()
list_from_line = line.split('\t')
return_mat[index, :] = list_from_line[0:3]
class_label_vector.append(int(list_from_line[-1]))
index += 1
return return_mat, class_label_vector

算法核心代码实现

def classify0(inX, dataset, labels, k):
"""
knn classify
@param inX: the input vector which is ready to be classified
@param dataset: the training data set
@param labels: labels vector
@param k: the k-th
@return: sorted result
"""
dataset_size = dataset.shape[0]     # calculate the number of lines
diff_mat = tile(inX, (dataset_size, 1)) - dataset
sq_diff_mat = diff_mat**2
sq_distances = sq_diff_mat.sum(axis=1)
distances = sq_distances**0.5

sorted_distance_indices = distances.argsort()
class_count = {}
for i in range(k):
vote_i_label = labels[sorted_distance_indices[i]]
class_count[vote_i_label] = class_count.get(vote_i_label, 0) + 1
sorted_class_count = sorted(class_count.items(), key=operator.itemgetter(1), reverse=True)
return sorted_class_count[0][0]

示例1：约会数据的分类

女主角Helen要在自己打了标签的数据里面得到一个模型，用以判断今后遇到的男生对她的魅力值和吸引力。

我们先来看一部分她打过标签的格式化数据：

40920   8.326976    0.953952    3
14488   7.153469    1.673904    2
26052   1.441871    0.805124    1
75136   13.147394   0.428964    1
38344   1.669788    0.134296    1
72993   10.141740   1.032955    1
35948   6.830792    1.213192    3
42666   13.276369   0.543880    3
67497   8.631577    0.749278    1
35483   12.273169   1.508053    3
50242   3.723498    0.831917    1
63275   8.385879    1.669485    1
5569    4.875435    0.728658    2
51052   4.680098    0.625224    1

从左至右依次是年飞行里程数、玩儿视频游戏所耗时间的百分比、每周消耗的冰激凌公升数以及最后的标签（1-3依次是不喜欢、喜欢和非常喜欢）PS. 话说貌似打会儿游戏还是很受欢迎的哈~

可将其绘制为“冰激凌-游戏时间图”如下：

准备数据：归一化

归一化就是把数据范围限制在某个明确的范围之内，比如接下来我们就需要把数据统一到（0，1）范围内，方便后续的数据处理。代码如下：

def autonorm(dataset):
"""
dataset normalization
@param dataset:
cd64
np.array
@return: dataset after norm, ranges, minimal value
"""
min_val = dataset.min(0)
max_val = dataset.max(0)
ranges = max_val - min_val
norm_dataset = zeros(shape(dataset))
m = dataset.shape[0]
norm_dataset = dataset - tile(min_val, (m, 1))
norm_dataset = norm_dataset/tile(ranges, (m, 1))
return norm_dataset, ranges, min_val

测试算法

编写针对此示例的算法测试代码：

def dating_class_test():
"""
dating data test and see the error ratio
@return: the output on screen which shows the result and the error rate
"""
ho_ratio = 0.1  # the ratio of test data
dating_data_mat, dating_labels = file2matrix('datingTestSet2.txt')
norm_mat, ranges, min_val = autonorm(dating_data_mat)
m = norm_mat.shape[0]
num_test_vec = int(m*ho_ratio)
error_count = 0.0
for i in range(num_test_vec):
# large scale data is used to be trained and small data is used to be test. 0:num_test_vec is small and
# num_test_vec:m is large
classify_result = classify0(norm_mat[i, :], norm_mat[num_test_vec:m, :], dating_labels[num_test_vec:m], 3)
print("the classifier came back with: %d, the real answer is %d" % (classify_result, dating_labels[i]))
if classify_result != dating_labels[i]:
error_count += 1.0
print("the total error rate is: %f%%" % (error_count/float(num_test_vec)*100.0))

代码结果演示如下：

使用算法

将此算法应用于具体的应用之内，根据一个人的三个标签特征判断他对Helen的吸引力程度：

def classify_person():
"""
Test the charm of a person to you
@return: print the result
"""
result_list = ['not at all', 'in small doses', 'in large doses']
percent_games = float(input('Percentage of time spent playing video games: '))
length_miles = float(input('Frequent flier miles earned per year: '))
ice_cream = float(input('Liters of ice cream consumed per year: '))

dating_data_mat, dating_labels = file2matrix('datingTestSet2.txt')
norm_mat, ranges, min_val = autonorm(dating_data_mat)
in_arr = array([length_miles, percent_games, ice_cream])
class_fier_result = classify0((in_arr - min_val)/ranges, norm_mat, dating_labels, 3)
print('You will probably like this person: ', result_list[class_fier_result-1])

算法运行结果如下：

怎么样，你是否也能捕获Helen的芳心呢（坏笑…）

示例2：手写识别系统

通过kNN算法将如下图所示的32*32数据进行判断：

准备数据：图像转换为测试向量

在这里，需要将数据从32*32转换为1*1024，有两种方法可行，第一种是书中的方法，即通过循环直接进行前后连接，第二种是直接使用numpy的flatten()方法，如下图所示：

这里以书中的方法为例：

def img2vector(filename):
"""
change the 32*32 image matrix to 1*1024 array
@param filename: the data set filename
@return: the 1*1024 array
"""
return_vect = zeros((1, 1024))
with open(filename) as fr:
for i in range(32):
line_str = fr.readline()
for j in range(32):
return_vect[0, 32*i+j] = int(line_str[j])
return return_vect

测试算法：使用kNN算法识别手写数字

def handwriting_class_test():
"""
handwriting test
@return: screen output
"""
hw_labels = []
training_file_list = listdir('trainingDigits')
m = len(training_file_list)
training_mat = zeros((m, 1024))
for i in range(m):
file_name_str = training_file_list[i]
file_str = file_name_str.split('.')[0]
class_num_str = int(file_str.split('_')[0])
hw_labels.append(class_num_str)
training_mat[i, :] = img2vector('trainingDigits/%s' % file_name_str)
test_file_list = listdir('testDigits')
error_count = 0.0
m_test = len(test_file_list)
for i in range(m_test):
file_name_str = test_file_list[i]
file_str = file_name_str.split('.')[0]
class_num_str = int(file_str.split('_')[0])
vector_under_test = img2vector('trainingDigits/%s' % file_name_str)
classifier_result = classify0(vector_under_test, training_mat, hw_labels, 3)
print('the classifier came back with: %d, the real answer is: %d' % (classifier_result, class_num_str))
if classifier_result != class_num_str:
error_count += 1.0
print('\nthe total number of errors is %d' % error_count)
print('\nthe total error rate is %f' % (error_count/float(m_test)))

运行结果如下：

更改数据量及k值可改变错误率。实测将k值缩小后错误率可降低至0.0%。

测试代码

# coding=utf-8
"""
knn algorithm test file
"""

import kNN
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

group, labels = kNN.create_dataset()
print(kNN.classify0([0, 0], group, labels, 3))

dating_data_mat, dating_labels = kNN.file2matrix('datingTestSet2.txt')
print(dating_data_mat)
print(dating_labels)

fig = plt.figure()
ax = fig.add_subplot(111)
# ax.scatter(dating_data_mat[:, 1], dating_data_mat[:, 2], 10*np.array(dating_labels), 10*np.array(dating_labels))
type1_x = []
type1_y = []
type2_x = []
type2_y = []
type3_x = []
type3_y = []
for i in range(len(dating_labels)):
if dating_labels[i] == 1:   # unlike
type1_x.append(dating_data_mat[i][1])
type1_y.append(dating_data_mat[i][2])
if dating_labels[i] == 2:   # like
type2_x.append(dating_data_mat[i][1])
type2_y.append(dating_data_mat[i][2])
if dating_labels[i] == 3:   # very like
type3_x.append(dating_data_mat[i][1])
type3_y.append(dating_data_mat[i][2])
type1 = ax.scatter(type1_x, type1_y, s=20)
type2 = ax.scatter(type2_x, type2_y, s=30)
type3 = ax.scatter(type3_x, type3_y, s=40)
ax.legend((type1, type2, type3), ('unlike', 'like', 'very_like'))

plt.xlabel('the Percentage of Playing Games')
plt.ylabel('the Cost of Ice-Creams per Week')
plt.title('the Data Set Distribution Figure')
plt.legend()
plt.show(fig)

norm_mat, ranges, min_val = kNN.autonorm(dating_data_mat)
print(norm_mat)
print(ranges)
print(min_val)

# kNN.dating_class_test()

# kNN.classify_person()

# kNN.handwriting_class_test()

不定期更新，未完待续。。。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航