您的位置：首页 > 其它

使用 Spark MLlib 做 K-means 聚类分析

2017-01-01 00:42 477 查看

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0 #
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

"""
The K-means algorithm written from scratch against PySpark. In practice,
one may prefer to use the KMeans algorithm in ML, as shown in
examples/src/main/python/ml/kmeans_example.py.

This example requires NumPy (http://www.numpy.org/).
"""
from __future__ import print_function

import sys

import numpy as np
from pyspark.sql import SparkSession

def parseVector(line):
return np.array([float(x) for x in line.split(' ')])

def closestPoint(p, centers):
bestIndex = 0
closest = float("+inf")
for i in range(len(centers)):
tempDist = np.sum((p - centers[i]) ** 2)
if tempDist < closest:
closest = tempDist
bestIndex = i
return bestIndex

深夜跟女朋友聊到最近在做的项目，她做的是基于哈希算法的图像识别上的应用，而我昨晚接到导师给的任务，用K-means做一下聚类分析，

觉得在人生最重要的阶段，我要做的就是提高自己的学习能力然后并在最后，学有所成。

对地铁数据做出

我现在就打算搞好hadoop Spark 这两个大数据架构，我的方向，她搞建模，好慢慢深入

她的话，给了我莫大的激励，“你一定要坚持下去啊== 千万别再弃坑了”

坚持住吧

首先明确运用kmeans算法对哪些数据做出分析，得出什么结果，

测试一下K-means算法的性能

训练集

一般做预测分析时，会将数据分为两大部分。一部分是训练数据，用于构建模型，一部分是测试数据，用于检验模型。但是，有时候模型的构建过程中也需要检验模型，辅助模型构建，所以会将训练数据在分为两个部分：1）训练数据；2）验证数据（Validation Data）。验证数据用于负责模型的构建。具体的是：训练数据（Test Data）：用于模型构建；验证数据（Validation Data）：可选，用于辅助模型构建，可以重复使用；测试数据（Test Data）：用于检测模型构建，此数据只在模型检验时使用，用于评估模型的准确率。绝对不允许用于模型构建过程，否则会导致过渡拟合。

K-Means属于基于平方误差的迭代重分配聚类算法，其核心思想十分简单：

随机选择K个中心点

计算所有点到这K个中心点的距离，选择距离最近的中心点为其所在的簇

简单的采用算术平均数（mean）来重新计算K个簇的中心

重复步骤2和3,直至簇类不在发生变化或者达到最大迭代值

输出结果

K-Means算法的结果好坏依赖于对初始聚类中心的选择，容易陷入局部最优解，对K值的选择没有准则可依循，对异常数据较为敏感，只能处理数值属性的数据，聚类结构可能不平衡。

参考文献

Clustering - RDD-based API - Spark 2.0.0 Documentation

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航