您的位置：首页 > 其它

K近邻k-Nearest Neighbor（KNN）算法的理解

2016-03-07 10:25 155 查看

一、KNN算法概述

KNN作为一种有监督分类算法，是最简单的机器学习算法之一，顾名思义，其算法主体思想就是根据距离相近的邻居类别，来判定自己的所属类别。算法的前提是需要有一个已被标记类别的训练数据集，具体的计算步骤分为一下三步：
1、计算测试对象与训练集中所有对象的距离，可以是欧式距离、余弦距离等，比较常用的是较为简单的欧式距离；
2、找出上步计算的距离中最近的K个对象，作为测试对象的邻居；
3、找出K个对象中出现频率最高的对象，其所属的类别就是该测试对象所属的类别。

二、算法优缺点

1、优点

思想简单，易于理解，易于实现，无需估计参数，无需训练；
适合对稀有事物进行分类；
特别适合于多分类问题。

2、缺点

懒惰算法，进行分类时计算量大，要扫描全部训练样本计算距离，内存开销大，评分慢；
当样本不平衡时，如其中一个类别的样本较大，可能会导致对新样本计算近邻时，大容量样本占大多数，影响分类效果；
可解释性较差，无法给出决策树那样的规则。

三、注意问题

1、K值的设定
K值设置过小会降低分类精度；若设置过大，且测试样本属于训练集中包含数据较少的类，则会增加噪声，降低分类效果。
通常，K值的设定采用交叉检验的方式（以K=1为基准）
经验规则：K一般低于训练样本数的平方根。
2、优化问题
压缩训练样本；
确定最终的类别时，不是简单的采用投票法，而是进行加权投票，距离越近权重越高。

四、python中scikit-learn对KNN算法的应用

#KNN调用
import numpy as np
from sklearn import datasets
iris = datasets.load_iris()
iris_X = iris.data
iris_y = iris.target
np.unique(iris_y)
# Split iris data in train and test data
# A random permutation, to split the data randomly
np.random.seed(0)
# permutation随机生成一个范围内的序列
indices = np.random.permutation(len(iris_X))
# 通过随机序列将数据随机进行测试集和训练集的划分
iris_X_train = iris_X[indices[:-10]]
iris_y_train = iris_y[indices[:-10]]
iris_X_test  = iris_X[indices[-10:]]
iris_y_test  = iris_y[indices[-10:]]
# Create and fit a nearest-neighbor classifier
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(iris_X_train, iris_y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=5, p=2,
weights='uniform')

knn.predict(iris_X_test)
print iris_y_test

KNeighborsClassifier方法中含有8个参数（以下前两个常用）：
n_neighbors : int, optional (default = 5)：K的取值，默认的邻居数量是5；
weights：确定近邻的权重，“uniform”权重一样，“distance”指权重为距离的倒数，默认情况下是权重相等。也可以自己定义函数确定权重的方式；
algorithm : {'auto', 'ball_tree', 'kd_tree', 'brute'},optional：计算最近邻的方法，可根据需要自己选择；
leaf_size : int, optional (default = 30)

| Leaf size passed to BallTree or KDTree. This can affect the

| speed of the construction and query, as well as the memory

| required to store the tree. The optimal value depends on the

| nature of the problem.

|

| metric : string or DistanceMetric object (default = 'minkowski')

| the distance metric to use for the tree. The default metric is

| minkowski, and with p=2 is equivalent to the standard Euclidean

| metric. See the documentation of the DistanceMetric class for a

| list of available metrics.

|

| p : integer, optional (default = 2)

| Power parameter for the Minkowski metric. When p = 1, this is

| equivalent to using manhattan_distance (l1), and euclidean_distance

| (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.

|

| metric_params: dict, optional (default = None)

| additional keyword arguments for the metric function.

输出结果：

结果一致。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航