您的位置：首页 > 编程语言 > Python开发

Python--网页更新监控工具

2016-04-06 13:44 716 查看

Python--网页更新监控工具

(2012-08-04 17:29:05)

转载▼

标签：

写这个网页更新监控工具，是因为最近要不停查看某个学校网站，看是否有考研最新消息发布。如果是人工的查看，确实比较费时，但是利用python对web开发的快速特点，就比较容易实现这个需求。

一、工作原理

首先是下载网页，这个在网上有太多的案例了，就不详细谈了。但是要注意对网页的中文字符的处理，这是python一直存在的问题。这里使用了chardet这个第三方包，能够对网页的编码进行测试，给出网页的实际编码。网页下载后，保存时候要按系统默认的编码进行保存，否则或出现乱码，最好是保存为二进制文件。

然后对html文件进行过滤，提取出网页的文本内容。这里网上也有很多案例。经过比对，我只是使用了正则

表达式来进行过滤，发现效果和效率都不错。

最后，就是对不同时间段抓取的网页内容进行比对。这里使用了difflib模块，只要两者有差异，就给出详细

差异结果。本工具的测试环境是：win7+python2.7.3，读者可以自行写个win系统的批处理脚本，定时调用此工具。第一次运行时候是没有结果输出的，因为第一次只是执行下载网页，还有比对的样本。还有就是，如果发现网页有更新，需要及时更新上一次的比对文件（txt格式文本）。

二、代码文件结构

1）downloadHtml 下载文件的模块

2）getEncoding 获得文件编码模块

3）Html2Text 提取文本内容模块

4）differFile 比对文件差异模块

5）monitorHtml 执行监控网页更新模块

三、测试截图

1)网页没有更新

2）网页有更新

留意有下划线部分内容，这就是网页内容有差异的部分。

四、主要代码

－－－－－－－－－－－downloadHtml－－－－－－－－－－－－－－－－－－－－－－－－－－－－－

import urllib

import urllib2

import getEncoding

import sys

def downloadHtml(websize, savefile):

'''

this methodis used download html,but if html contain chinese charateres

should notuse this method

'''

#At firstcheck the encoding of html

encoding =getEncoding.quick_getHtmlEncoding(websize)

content =urllib2.urlopen(websize).read()

type =sys.getfilesystemencoding()

s =content.decode(encoding).encode(type)

file =open(savefile, 'wb')

file.write(s)

file.close()

－－－－－－－－－－－differFile－－－－－－－－－－－－－－－－－－－－－－

import difflib

def isDiff(srcfile, tarfile):

'''

compare withtwo files,if equal then return ture

'''

src =file(srcfile).read().split(' ')

tar =file(tarfile).read().split(' ')

ret =1

# ignoreblank lines

temp =difflib.SequenceMatcher(lambda x: len(x.strip()) == 0, src,tar)

for tag, i1,i2, j1, j2 in temp.get_opcodes():

#print tag

if tag != 'equal':

ret = 0

break

return (Trueif ret == 1 else False)

def getDetails(srcfile, tarfile, flag = 'all'):

'''

compare wtihtwo files,if different then output details

'''

temp1_context = file(srcfile).read()

temp2_context = file(tarfile).read()

file1_context = temp1_context.splitlines()

file2_context = temp2_context.splitlines()

diff =difflib.Differ().compare(file1_context, file2_context)

if flag =='all':

#output all context

print "\n".join(list(diff))

else:

#only output different part of context

linenum = 1

for line in diff:

if line[0] != ' ':

print 'line:%d %s'%(linenum, line)

else:

linenum = linenum + 1

－－－－－－－－monitorHtml－－－－－－－－－－－－－－－－－－－－－－－－－

import downloadHtml

import differFile

import Html2Text

import os.path

def isExists(saveFile):

'''

check filewhether existed

'''

returnos.path.isfile(saveFile)

def monitorHtml(websize, savehtml, savetxt, originaltxt):

'''

monitorassign html, if context of html has changed then outputdetails

'''

downloadHtml.downloadHtml(websize, savehtml)

ifisExists(originaltxt):

Html2Text.Html2Txt(savehtml, savetxt)

if(differFile.isDiff(originaltxt, savetxt)):

print 'These two files are equal.'

else:

print 'These two files are different:'

differFile.getDetails(originaltxt, savetxt,'notall')

else:

Html2Text.Html2Txt(savehtml, originaltxt)

if __name__ == '__main__':

websize1 ='http://www.baidu.com'

srcname1 ='E:\pyproj\differHtml\orginal.txt'

htmlname1 ='E:\pyproj\differHtml\src.htm'

txtname1 ='E:\pyproj\differHtml\src2txt.txt'

monitorHtml(websize1, htmlname1, txtname1, srcname1)

－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－

ps:
完整代码下载

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航

Python--网页更新监控工具

Python--网页更新监控工具

网页监控更新

python

it