python k-means聚类
2017-05-17 16:24
127 查看
K-Means聚类思想
随机选K个点作为中心根据剩下点与选出的K个中心点的距离,归入最近的类
重新计算所有点的均值作为中心
重复2,3直至聚类中心不再发生改变
python实现:
import numpy as np from sklearn.cluster import KMeans def loadData(filePath): fr = open(filePath,'r+') lines = fr.readlines() retData=[] retCityName = [] for line in lines: items = line.strip().split(',')#文件预处理,划分名字和消费水平 retCityName.append(items[0]) retData.append([float(items[i]) for i in range(1,len(items))]) return retData,retCityName if __name__=='__main__': data,cityName = loadData('city.txt') km = KMeans(n_clusters=3)#聚成三类 label = km.fit_predict(data)#打上相应标签 expenses = np.sum(km.cluster_centers_,axis=1)#计算每一行的和(即每个类的消费水平) print(expenses) CityCluster = [[],[],[]] for i in range(len(cityName)): CityCluster[label[i]].append(cityName[i])#对应的标签中放入相应的城市名称 for i in range(len(CityCluster)): print('Expenses:%.2f'% expenses[i])#输出每类消费水平 print(CityCluster[i])
DBSCAN密度聚类
特点:聚类的时候不需要预先制定簇的个数,因此最终类别数不定DBSCAN算法将数据点分为三类:
1. 核心点:在半径Eps内含有超过MinPts数目的点
2. 边界点:在半径Eps内点的数量少于MinPts,但是落在核心点的领域内
3. 噪音点:非以上两类的点
DBSCAN算法流程:
1. 将所有点标记为核心点,边界点或噪声点(对每个点计算领域Eps=3内的点的集合,集合内点的个数超过MinPts=3的点为核心点,剩余点若在核心点的领域内,则为边界点,不在则为噪声点)
2. 删除噪声点
3. 为距离在Eps之内的所有核心点赋予一条边
4. 每组连通的核心点形成一个类
5. 将每个边界点指派到一个与之关联的核心点的类中(某个核心点的半径范围之内)
python实现
import numpy as np import sklearn.cluster as skc from sklearn import metrics import matplotlib.pyplot as plt mac2id = dict() onlinetimes = [] with open('TestData.txt',encoding = 'utf-8') as f: for line in f: mac = line.split(',')[2]#读取MAC地址 onlinetime = int(line.split(',')[6])#读取上网时长 starttime = int(line.split(',')[4].split(' ')[1].split(':')[0])#读取上网时间(小时) if mac not in mac2id: mac2id[mac] = len(onlinetimes) onlinetimes.append((starttime,onlinetime)) else: onlinetimes[mac2id[mac]] = [(starttime,onlinetime)] real_X = np.array(onlinetimes).reshape((-1,2))#右对齐 x = real_X[:,0:1] db = skc.DBSCAN(eps = 0.01,min_samples=20).fit(x) labels = db.labels_ print('Lables:') print(labels) raito = len(labels[labels[:]==-1])/len(labels) print('noise raito',format(raito,'.2%')) n_clusters_ = len(set(labels))-(1 if -1 in labels else 0) print('Estimated number of clusters:%d' % n_clusters_) print('silhouette cofficient:%0.3f' % metrics.silhouette_score(x,labels)) for i in range(n_clusters_): print('cluster',i,':') print(list(x[labels == i].flatten())) plt.hist(x,24) x = np.log(1+real_X[:,1:]) db = skc.DBSCAN(eps = 0.14,min_samples = 10).fit(x) labels = db.labels_ print('Lables:') print(labels) raito = len(labels[labels[:]==-1])/len(labels) print('noise raito:',format(raito,'.2%')) n_clusters_=len(set(labels))-(1 if -1 in labels else 0) print('Estimated number of clusters:%d' % n_clusters_) print('silhouette cofficient:%0.3f' % metrics.silhouette_score(x,labels)) for i in range(n_clusters_): print('cluster',i,':') cout = len(x[labels==i]) mean = np.mean(real_X[labels == i][:,1]) std = np.std(real_X[labels == i][:,1]) print('\t number of sample:',cout) print('\t mean of sample:',format(mean,'.1f')) print('\t std of sample:',format(std,'.1f'))
实验代码和数据打包下载通道:点击跳转下载页
相关文章推荐
- Python实现K-means聚类
- python学习笔记 python实现k-means聚类
- Python机器学习之K-Means聚类实现详解
- 【Python】scikit-learn机器学习(八)——K-means聚类
- Python linecache.getline()读取文件中特定一行的脚本
- Python3.6安装win32扩展并且实现对PPT文件进行截图操作的方法
- python调用dll
- python调用c++回调图片
- python正则表达式测试代码
- win10下安装 python2 和python3
- python基础学习-7(简单爬虫)
- openCV for python 学习(一):环境搭建与图片显示
- python __name__
- (8)python字符和数字初步接触
- 【详解】Python脚本转可执行文件进阶版
- python--爬虫入门(八)体验HTMLParser解析网页,网页抓取解析整合练习
- Python变量详解
- Python学习--subprocess