(数据挖掘-入门-9)聚类
2015-10-06 20:32
435 查看
主要内容:
1、动机
2、聚类
3、python实现
而实际中的原生数据都是没有标注的,如果没有标签,是否也能为这些数据进行分类呢?
答案是肯定的,那就是本文要介绍的无监督学习方法——聚类。
有监督学习:对带类别标签的数据集进行学习,训练出一个分类模型对新来的样本进行预测
无监督学习:对无类别标签的数据集进行学习,以发现训练集中数据的类别归属。
1、层次聚类hierarchical clustering
层次聚类的原理:不指定要分为多少类,起始时,每个样本各自成为一簇,之后进行迭代,每次迭代时合并距离最近的两个簇,依次进行,直至最后只剩下一个簇。
结果有点类似于霍夫曼树的树形结构,如下图:
细节:簇间距离的计算有以下三种:single-linkage聚类法,complete-linkage聚类法,average-linkage聚类法
具体参考:
http://baike.baidu.com/link?url=OxYi2gA8dsfvyg8EjnxzwNkh3YHpqC8rePFdNeOUDYbnzE1XSKDxAj1-F9P0htnUHnUBx7vBflMLjWZdajcan_
View Code
1、动机
2、聚类
3、python实现
一、动机
之前我们实现的分类器都是基于带标签或类别的数据集,这种学习方法叫做有监督的学习,这些数据一般都是通过人工标注的,成本和代价比较高。而实际中的原生数据都是没有标注的,如果没有标签,是否也能为这些数据进行分类呢?
答案是肯定的,那就是本文要介绍的无监督学习方法——聚类。
有监督学习:对带类别标签的数据集进行学习,训练出一个分类模型对新来的样本进行预测
无监督学习:对无类别标签的数据集进行学习,以发现训练集中数据的类别归属。
二、聚类Clustering
两种聚类方法:1、层次聚类hierarchical clustering
层次聚类的原理:不指定要分为多少类,起始时,每个样本各自成为一簇,之后进行迭代,每次迭代时合并距离最近的两个簇,依次进行,直至最后只剩下一个簇。
结果有点类似于霍夫曼树的树形结构,如下图:
细节:簇间距离的计算有以下三种:single-linkage聚类法,complete-linkage聚类法,average-linkage聚类法
具体参考:
http://baike.baidu.com/link?url=OxYi2gA8dsfvyg8EjnxzwNkh3YHpqC8rePFdNeOUDYbnzE1XSKDxAj1-F9P0htnUHnUBx7vBflMLjWZdajcan_
import math import random """ Implementation of the K-means++ algorithm for the book A Programmer's Guide to Data Mining" http://www.guidetodatamining.com """ def getMedian(alist): """get median of list""" tmp = list(alist) tmp.sort() alen = len(tmp) if (alen % 2) == 1: return tmp[alen // 2] else: return (tmp[alen // 2] + tmp[(alen // 2) - 1]) / 2 def normalizeColumn(column): """normalize the values of a column using Modified Standard Score that is (each value - median) / (absolute standard deviation)""" median = getMedian(column) asd = sum([abs(x - median) for x in column]) / len(column) result = [(x - median) / asd for x in column] return result class kClusterer: """ Implementation of kMeans Clustering This clusterer assumes that the first column of the data is a label not used in the clustering. The other columns contain numeric data """ def __init__(self, filename, k): """ k is the number of clusters to make This init method: 1. reads the data from the file named filename 2. stores that data by column in self.data 3. normalizes the data using Modified Standard Score 4. randomly selects the initial centroids 5. assigns points to clusters associated with those centroids """ file = open(filename) self.data = {} self.k = k self.counter = 0 self.iterationNumber = 0 # used to keep track of % of points that change cluster membership # in an iteration self.pointsChanged = 0 # Sum of Squared Error self.sse = 0 # # read data from file # lines = file.readlines() file.close() header = lines[0].split(',') self.cols = len(header) self.data = [[] for i in range(len(header))] # we are storing the data by column. # For example, self.data[0] is the data from column 0. # self.data[0][10] is the column 0 value of item 10. for line in lines[1:]: cells = line.split(',') toggle = 0 for cell in range(self.cols): if toggle == 0: self.data[cell].append(cells[cell]) toggle = 1 else: self.data[cell].append(float(cells[cell])) self.datasize = len(self.data[1]) self.memberOf = [-1 for x in range(len(self.data[1]))] # # now normalize number columns # for i in range(1, self.cols): self.data[i] = normalizeColumn(self.data[i]) # select random centroids from existing points random.seed() self.selectInitialCentroids() self.assignPointsToCluster() def showData(self): for i in range(len(self.data[0])): print("%20s %8.4f %8.4f" % (self.data[0][i], self.data[1][i], self.data[2][i])) def distanceToClosestCentroid(self, point, centroidList): result = self.eDistance(point, centroidList[0]) for centroid in centroidList[1:]: distance = self.eDistance(point, centroid) if distance < result: result = distance return result def selectInitialCentroids(self): """implement the k-means++ method of selecting the set of initial centroids""" centroids = [] total = 0 # first step is to select a random first centroid current = random.choice(range(len(self.data[0]))) centroids.append(current) # loop to select the rest of the centroids, one at a time for i in range(0, self.k - 1): # for every point in the data find its distance to # the closest centroid weights = [self.distanceToClosestCentroid(x, centroids) for x in range(len(self.data[0]))] total = sum(weights) # instead of raw distances, convert so sum of weight = 1 weights = [x / total for x in weights] # # now roll virtual die num = random.random() total = 0 x = -1 # the roulette wheel simulation while total < num: x += 1 total += weights[x] centroids.append(x) self.centroids = [[self.data[i][r] for i in range(1, len(self.data))] for r in centroids] def updateCentroids(self): """Using the points in the clusters, determine the centroid (mean point) of each cluster""" members = [self.memberOf.count(i) for i in range(len(self.centroids))] self.centroids = [[sum([self.data[k][i] for i in range(len(self.data[0])) if self.memberOf[i] == centroid])/members[centroid] for k in range(1, len(self.data))] for centroid in range(len(self.centroids))] def assignPointToCluster(self, i): """ assign point to cluster based on distance from centroids""" min = 999999 clusterNum = -1 for centroid in range(self.k): dist = self.euclideanDistance(i, centroid) if dist < min: min = dist clusterNum = centroid # here is where I will keep track of changing points if clusterNum != self.memberOf[i]: self.pointsChanged += 1 # add square of distance to running sum of squared error self.sse += min**2 return clusterNum def assignPointsToCluster(self): """ assign each data point to a cluster""" self.pointsChanged = 0 self.sse = 0 self.memberOf = [self.assignPointToCluster(i) for i in range(len(self.data[1]))] def eDistance(self, i, j): """ compute distance of point i from centroid j""" sumSquares = 0 for k in range(1, self.cols): sumSquares += (self.data[k][i] - self.data[k][j])**2 return math.sqrt(sumSquares) def euclideanDistance(self, i, j): """ compute distance of point i from centroid j""" sumSquares = 0 for k in range(1, self.cols): sumSquares += (self.data[k][i] - self.centroids[j][k-1])**2 return math.sqrt(sumSquares) def kCluster(self): """the method that actually performs the clustering As you can see this method repeatedly updates the centroids by computing the mean point of each cluster re-assign the points to clusters based on these new centroids until the number of points that change cluster membership is less than 1%. """ done = False while not done: self.iterationNumber += 1 self.updateCentroids() self.assignPointsToCluster() # # we are done if fewer than 1% of the points change clusters # if float(self.pointsChanged) / len(self.memberOf) < 0.01: done = True print("Final SSE: %f" % self.sse) def showMembers(self): """Display the results""" for centroid in range(len(self.centroids)): print ("\n\nClass %i\n========" % centroid) for name in [self.data[0][i] for i in range(len(self.data[0])) if self.memberOf[i] == centroid]: print (name) ## ## RUN THE K-MEANS CLUSTERER ON THE DOG DATA USING K = 3 ### km = kClusterer('dogs.csv', 3) km.kCluster() km.showMembers()
View Code
相关文章推荐
- 关于linux下 gcc 编译时for循环的报错。c99
- 在Android中操作JSON数据
- 给LinearLayout设置点击事件
- UI 动画之UIView动画 实现两个页面之间的切换
- Guava 学习笔记 02
- Cocos 3D功能初探学习笔记(3)---光照
- 高级四则运算器—结对项目反思(193 & 105)
- 点击return和空白回收键盘跳到下一个
- 不见棺材不落泪,不到黄河不死心
- C#生成缩略图的方法
- C#学习日记16----隐式转换具体用例
- 【期望dp】hdu 4405 Aeroplane chess
- struts2+ajax中json数据返回格式参数详解
- Android中ListView包含CheckBox时滑动丢失选中状态的解决
- Print number in a pyramid pattern
- java——练习题4.16
- struts2笔记-国际化
- 结对项目——高级四则运算检验器记录(168 & 187)
- 使用Axis 开发Web Service服务器端
- Aptana插件版Zip包下载方法