网页排序 HITS算法的python实现
2016-03-26 19:05
1626 查看
算法原理不在赘述,请参考:
http://blog.csdn.net/hguisu/article/details/8013489
将代码保存为.py格式,默认使用的数据是代码文件所在目录下data目录下的 pgr_data.txt 文件分别作为源数据输入。以上参数可以在源代码中修改,也可以使用命令行参数传入,参考以下启动方式:
python hits.py pgr_data.txt
命令中后参数为输入数据的途径。
代码中设立了三个参数,分别为:
size = 100 ### the size of the networks
times = 200 ### the maxmim times for iterations
error = 0.0001 ### the error used for stoping the iterations分别为hits算法的网络的最大节点数,迭代最大次数,最大误差允许。最后两个参数用来限制迭代次数。
python 源代码如下:
__author__ = 'Administrator'
import re
import sys
size = 100 ### the size of the networks
times = 200 ### the maxmim times for iterations
error = 0.0001 ### the error used for stoping the iterations
tr_data = [[0 for i in range(size)] for j in range(size)]
sum = [0 for i in range(size)]
tr_lg = 0
st =set()
def hits():
for i in range(tr_lg):
for j in range(tr_lg):
ha[i][j] = tr_data[i+1][j+1]
k=0
while(k<times-1):
err=0
k+=1
# print k
for i in range(tr_lg):
for j in range(tr_lg):
if ha[i][j]!=0:
hub[k][i] += aut[k-1][j]
aut[k][j] += hub[k-1][i]
a=b=0
for i in range(tr_lg):
a+=hub[k][i]
b+=aut[k][i]
for i in range(tr_lg):
hub[k][i] = float( hub[k][i])/a
aut[k][i] = float(aut[k][i])/b
err += abs(hub[k][i]-hub[k-1][i]) + abs(aut[k][i]-aut[k-1][i])
if err<error:
break
return k;
if __name__ == '__main__':
#for a in sys.argv:
# print a
sour = "data/pgr_data.txt"
if len(sys.argv)>1:
sour = sys.argv[1]
fp=open(sour,"r")
for line in fp:
# print line
line=re.sub(r"\n\r","",line)
ls=line.split()
l=len(ls)
# print l,ls,int(ls[0]),int(ls[1])
for i in range(l):
st.add(ls[i])
tr_data[int(ls[0])][int(ls[1])] = 1
sum[int(ls[0])] += 1
tr_lg = len(st)
print "the number of websites:",tr_lg
#print sum[1:tr_lg+1]
am = [[0.0 for i in range(tr_lg)] for j in range(tr_lg)]
res = [[0 for i in range(tr_lg)] for j in range(times)]
hub = [[0 for i in range(tr_lg)] for j in range(times)]
aut = [[0 for i in range(tr_lg)] for j in range(times)]
print "\n"
ha = [[0 for i in range(tr_lg)] for j in range(tr_lg)]
n=hits()
print "iteration times:",n,"\n","the hub:",hub
,"\nthe authority:",aut
fp.close()
http://blog.csdn.net/hguisu/article/details/8013489
将代码保存为.py格式,默认使用的数据是代码文件所在目录下data目录下的 pgr_data.txt 文件分别作为源数据输入。以上参数可以在源代码中修改,也可以使用命令行参数传入,参考以下启动方式:
python hits.py pgr_data.txt
命令中后参数为输入数据的途径。
代码中设立了三个参数,分别为:
size = 100 ### the size of the networks
times = 200 ### the maxmim times for iterations
error = 0.0001 ### the error used for stoping the iterations分别为hits算法的网络的最大节点数,迭代最大次数,最大误差允许。最后两个参数用来限制迭代次数。
python 源代码如下:
__author__ = 'Administrator'
import re
import sys
size = 100 ### the size of the networks
times = 200 ### the maxmim times for iterations
error = 0.0001 ### the error used for stoping the iterations
tr_data = [[0 for i in range(size)] for j in range(size)]
sum = [0 for i in range(size)]
tr_lg = 0
st =set()
def hits():
for i in range(tr_lg):
for j in range(tr_lg):
ha[i][j] = tr_data[i+1][j+1]
k=0
while(k<times-1):
err=0
k+=1
# print k
for i in range(tr_lg):
for j in range(tr_lg):
if ha[i][j]!=0:
hub[k][i] += aut[k-1][j]
aut[k][j] += hub[k-1][i]
a=b=0
for i in range(tr_lg):
a+=hub[k][i]
b+=aut[k][i]
for i in range(tr_lg):
hub[k][i] = float( hub[k][i])/a
aut[k][i] = float(aut[k][i])/b
err += abs(hub[k][i]-hub[k-1][i]) + abs(aut[k][i]-aut[k-1][i])
if err<error:
break
return k;
if __name__ == '__main__':
#for a in sys.argv:
# print a
sour = "data/pgr_data.txt"
if len(sys.argv)>1:
sour = sys.argv[1]
fp=open(sour,"r")
for line in fp:
# print line
line=re.sub(r"\n\r","",line)
ls=line.split()
l=len(ls)
# print l,ls,int(ls[0]),int(ls[1])
for i in range(l):
st.add(ls[i])
tr_data[int(ls[0])][int(ls[1])] = 1
sum[int(ls[0])] += 1
tr_lg = len(st)
print "the number of websites:",tr_lg
#print sum[1:tr_lg+1]
am = [[0.0 for i in range(tr_lg)] for j in range(tr_lg)]
res = [[0 for i in range(tr_lg)] for j in range(times)]
hub = [[0 for i in range(tr_lg)] for j in range(times)]
aut = [[0 for i in range(tr_lg)] for j in range(times)]
print "\n"
ha = [[0 for i in range(tr_lg)] for j in range(tr_lg)]
n=hits()
print "iteration times:",n,"\n","the hub:",hub
,"\nthe authority:",aut
fp.close()
相关文章推荐
- Python动态类型的学习---引用的理解
- Python3写爬虫(四)多线程实现数据爬取
- 垃圾邮件过滤器 python简单实现
- 下载并遍历 names.txt 文件,输出长度最长的回文人名。
- install and upgrade scrapy
- Scrapy的架构介绍
- Centos6 编译安装Python
- 使用Python生成Excel格式的图片
- 让Python文件也可以当bat文件运行
- [Python]推算数独
- Python中zip()函数用法举例
- Python中map()函数浅析
- Python将excel导入到mysql中
- Python在CAM软件Genesis2000中的应用
- 使用Shiboken为C++和Qt库创建Python绑定
- FREEBASIC 编译可被python调用的dll函数示例
- Python 七步捉虫法