您的位置：首页 > 编程语言

集体智慧编程（二）发现群组

2016-03-12 22:16 387 查看

博客地址：http://andyheart.me，首先会更新本人自己的博客，然后更新CSDN。

有错误之处，请私信或者评论，多谢。

概念

数据聚类：一种用以寻找紧密相关的事、人或观点，并将其可视化的方法。目的是采集数据，然后从中找出不同的群组。

监督学习：利用样本输入和期望输出来学习如何预测的技术。例如，神经网络，决策树，支持向量机，贝叶斯过滤。

无监督学习：在一组数据中找寻某种结构，数据本身不是我们要找的答案。

分级聚类：通过连续不断地将最为相似的群组两两合并，来构造出一个群组的层级结构。其中的每个群组都是从单一元素开始的。

K均值聚类：首先随机确定K个中心位置，然后将各个数据项分配给最邻近的中心点。待分配完成之后，聚类中心就会移到分配给该聚类的所有节点的平均位置，然后分配过程重新开始。一直重复直到分配过程不再产生变化为止。

主要内容

从各种不同的来源中构造算法所需的数据；

两种不同的聚类算法（分级聚类和K-均值聚类）；

更多有关距离度量的知识；

简单的图形可视化代码，用以观察所生成的群组；

将异常复杂的数据集投影到二维空间中。

示例

对博客用户进行分类

根据单词出现的频度对博客进行聚类，可以分析出经常撰写相似主题的人。

（一）对订阅源中的单词进行计数

RSS订阅源 是一个包含博客及其所有文章条目信息的简单的XML文档。为了给单词计数，首先应该解析这些订阅源，可以利用Universal Feed Parser。

代码解释：这一部分主要是为了得到将要进行处理的数据集。代码由

python

实现文件为

generatefeedvector

。主要流程为：利用

Universal Feed Parser

将从

feedlist.txt

中列表的地址中得到的RSS源一一解析得到标题和文章条目从而从中分离到word再计数。

Python代码如下：

import feedparser
import re

# Returns title and dictionary of word counts for an RSS feed
def getwordcounts(url):
# Parse the feed
d=feedparser.parse(url)
wc={}

# Loop over all the entries
for e in d.entries:
if 'summary' in e: summary=e.summary
else: summary=e.description

# Extract a list of words
words=getwords(e.title+' '+summary)
for word in words:
wc.setdefault(word,0)
wc[word]+=1
return d.feed.title,wc

def getwords(html):
# Remove all the HTML tags
txt=re.compile(r'<[^>]+>').sub('',html)

# Split words by all non-alpha characters
words=re.compile(r'[^A-Z^a-z]+').split(txt)

# Convert to lowercase
return [word.lower() for word in words if word!='']

apcount={}
wordcounts={}
feedlist=[line for line in file('feedlist.txt')]
for feedurl in feedlist:
try:
title,wc=getwordcounts(feedurl)
wordcounts[title]=wc
for word,count in wc.items():
apcount.setdefault(word,0)
if count>1:
apcount[word]+=1
except:
print 'Failed to parse feed %s' % feedurl

wordlist=[]
for w,bc in apcount.items():
frac=float(bc)/len(feedlist)
if frac>0.1 and frac<0.5:
wordlist.append(w)

out=file('blogdata1.txt','w')
out.write('Blog')
for word in wordlist: out.write('\t%s' % word)
out.write('\n')
for blog,wc in wordcounts.items():
print blog
out.write(blog)
for word in wordlist:
if word in wc: out.write('\t%d' % wc[word])
else: out.write('\t0')
out.write('\n')

feedlist.txt中的url地址列举如下几个：

http://gofugyourself.typepad.com/go_fug_yourself/index.rdf http://googleblog.blogspot.com/rss.xml http://feeds.feedburner.com/GoogleOperatingSystem http://headrush.typepad.com/creating_passionate_users/index.rdf http://feeds.feedburner.com/instapundit/main http://jeremy.zawodny.com/blog/rss2.xml http://joi.ito.com/index.rdf http://feeds.feedburner.com/Mashable http://michellemalkin.com/index.rdf http://moblogsmoproblems.blogspot.com/rss.xml http://newsbusters.org/node/feed http://beta.blogger.com/feeds/27154654/posts/full?alt=rss http://feeds.feedburner.com/paulstamatiou http://powerlineblog.com/index.rdf

（二）对数据进行分级聚类

这一部分主要对数据集，也就是单词向量进行皮尔逊相关系数的计算，从而得到相关程度的度量。递归以后得到博客的分组。此部分的代码写在

clusters.py

文件中。

主要包括

readfile()

方法（加载数据文件）、

pearson(v1,v2)

返回两个列表的皮尔逊相关系数、

hcluster()

就是分级聚类的主要函数。

（三）分级聚类可视化（绘制树状图）

主要利用

python

的PIL包进行聚类的树状图的绘制。

（四）对数据进行K-均值聚类

Python

代码如下：

def kcluster(rows,distance=pearson,k=4):
# Determine the minimum and maximum values for each point
ranges=[(min([row[i] for row in rows]),max([row[i] for row in rows]))
for i in range(len(rows[0]))]

# Create k randomly placed centroids
clusters=[[random.random()*(ranges[i][1]-ranges[i][0])+ranges[i][0]
for i in range(len(rows[0]))] for j in range(k)]

lastmatches=None
for t in range(100):
print 'Iteration %d' % t
bestmatches=[[] for i in range(k)]

# Find which centroid is the closest for each row
for j in range(len(rows)):
row=rows[j]
bestmatch=0
for i in range(k):
d=distance(clusters[i],row)
if d<distance(clusters[bestmatch],row): bestmatch=i
bestmatches[bestmatch].append(j)

# If the results are the same as last time, this is complete
if bestmatches==lastmatches: break
lastmatches=bestmatches

# Move the centroids to the average of their members
for i in range(k):
avgs=[0.0]*len(rows[0])
if len(bestmatches[i])>0:
for rowid in bestmatches[i]:
for m in range(len(rows[rowid])):
avgs[m]+=rows[rowid][m]
for j in range(len(avgs)):
avgs[j]/=len(bestmatches[i])
clusters[i]=avgs

return bestmatches

最后提到了针对于偏好的聚类，对于写推荐引擎有一定的帮助。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航