您的位置:首页 > 编程语言 > Python开发

[python爬虫] 抓取糗事百科的爬虫程序

2015-07-24 10:29 501 查看

抓取糗事百科的爬虫程序

先贴上代码,等假期回家了把过程写一写

# -*- coding:utf-8 -*-
import re
import urllib2

page = 1
url = 'http://www.qiushibaike.com/hot/page/' + str(page)
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent' : user_agent }
try:
request = urllib2.Request(url,headers = headers)
response = urllib2.urlopen(request)
# print response.read()
except urllib2.URLError, e:
if hasattr(e,"code"):
print e.code
if hasattr(e,"reason"):
print e.reason
# pattern = re.compile('<div class="author".*?>.*?<a.*?>.*?<img.*?/>(.*?)</a>.*?</div>',re.S) # 该表达式可以匹配出作者

pattern = re.compile('<div class="author".*?>.*?<a.*?>.*?<img.*?/>(.*?)</a>.*?</div>.*?<div class="content">(.*?)<!--(.*?)-->.*?</div>',re.S)

content = response.read().decode('utf-8')
items = re.findall(pattern,content)

for i in items:
print '<<<'+'-'*60+'>>>'
print 'author:'+ i[0].strip()
print 'content:'+ i[1].strip()
print 'time:'+ i[2].strip()
print '\n'
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: