您的位置:首页 > 编程语言 > Python开发

python k-means聚类

2017-05-17 16:24 127 查看

K-Means聚类思想

随机选K个点作为中心

根据剩下点与选出的K个中心点的距离,归入最近的类

重新计算所有点的均值作为中心

重复2,3直至聚类中心不再发生改变

python实现:

import numpy as np
from sklearn.cluster import KMeans
def loadData(filePath):
fr = open(filePath,'r+')
lines = fr.readlines()
retData=[]
retCityName = []
for line in lines:
items = line.strip().split(',')#文件预处理,划分名字和消费水平
retCityName.append(items[0])
retData.append([float(items[i]) for i in range(1,len(items))])
return retData,retCityName
if __name__=='__main__':
data,cityName = loadData('city.txt')
km = KMeans(n_clusters=3)#聚成三类
label = km.fit_predict(data)#打上相应标签
expenses = np.sum(km.cluster_centers_,axis=1)#计算每一行的和(即每个类的消费水平)
print(expenses)
CityCluster = [[],[],[]]
for i in range(len(cityName)):
CityCluster[label[i]].append(cityName[i])#对应的标签中放入相应的城市名称
for i in range(len(CityCluster)):
print('Expenses:%.2f'% expenses[i])#输出每类消费水平
print(CityCluster[i])


DBSCAN密度聚类

特点:聚类的时候不需要预先制定簇的个数,因此最终类别数不定

DBSCAN算法将数据点分为三类:

1. 核心点:在半径Eps内含有超过MinPts数目的点

2. 边界点:在半径Eps内点的数量少于MinPts,但是落在核心点的领域内

3. 噪音点:非以上两类的点

DBSCAN算法流程:

1. 将所有点标记为核心点,边界点或噪声点(对每个点计算领域Eps=3内的点的集合,集合内点的个数超过MinPts=3的点为核心点,剩余点若在核心点的领域内,则为边界点,不在则为噪声点)

2. 删除噪声点

3. 为距离在Eps之内的所有核心点赋予一条边

4. 每组连通的核心点形成一个类

5. 将每个边界点指派到一个与之关联的核心点的类中(某个核心点的半径范围之内)

python实现

import numpy as np
import sklearn.cluster as skc
from sklearn import metrics
import matplotlib.pyplot as plt

mac2id = dict()
onlinetimes = []
with open('TestData.txt',encoding = 'utf-8') as f:
for line in f:
mac = line.split(',')[2]#读取MAC地址
onlinetime = int(line.split(',')[6])#读取上网时长
starttime = int(line.split(',')[4].split(' ')[1].split(':')[0])#读取上网时间(小时)
if mac not in mac2id:
mac2id[mac] = len(onlinetimes)
onlinetimes.append((starttime,onlinetime))
else:
onlinetimes[mac2id[mac]] = [(starttime,onlinetime)]
real_X = np.array(onlinetimes).reshape((-1,2))#右对齐
x = real_X[:,0:1]
db = skc.DBSCAN(eps = 0.01,min_samples=20).fit(x)
labels = db.labels_
print('Lables:')
print(labels)
raito = len(labels[labels[:]==-1])/len(labels)
print('noise raito',format(raito,'.2%'))
n_clusters_ = len(set(labels))-(1 if -1 in labels else 0)
print('Estimated number of clusters:%d' % n_clusters_)
print('silhouette cofficient:%0.3f' % metrics.silhouette_score(x,labels))
for i in range(n_clusters_):
print('cluster',i,':')
print(list(x[labels == i].flatten()))
plt.hist(x,24)
x = np.log(1+real_X[:,1:])
db = skc.DBSCAN(eps = 0.14,min_samples = 10).fit(x)
labels = db.labels_
print('Lables:')
print(labels)
raito = len(labels[labels[:]==-1])/len(labels)
print('noise raito:',format(raito,'.2%'))
n_clusters_=len(set(labels))-(1 if -1 in labels else 0)
print('Estimated number of clusters:%d' % n_clusters_)
print('silhouette cofficient:%0.3f' % metrics.silhouette_score(x,labels))
for i in range(n_clusters_):
print('cluster',i,':')
cout = len(x[labels==i])
mean = np.mean(real_X[labels == i][:,1])
std = np.std(real_X[labels == i][:,1])
print('\t number of sample:',cout)
print('\t mean of sample:',format(mean,'.1f'))
print('\t std of sample:',format(std,'.1f'))


实验代码和数据打包下载通道:点击跳转下载页
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  python 聚类