您的位置:首页 > 编程语言 > Python开发

python 爬虫学习笔记(1)

2017-04-07 10:08 405 查看
目标:爬取糗事百科的段子代码:
# -*- coding: utf-8 -*-
__author__ = 'beauty'
import sys
type = sys.getfilesystemencoding() #为了防止出现乱码
import urllib2import repage = 1url = 'http://www.qiushibaike.com/hot/page/' + str(page)user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'headers = { 'User-Agent' : user_agent }try:request = urllib2.Request(url,headers = headers)response = urllib2.urlopen(request)content = response.read().decode('utf-8')# print content.encode(type)pattern = re.compile('<div class="author clearfix">.*?href.*?<img src.*?title=.*?<h2>(.*?)</h2>.*?<div class="content">(.*?)</div>.*?<i class="number">(.*?)</i>',re.S)items = re.findall(pattern,content)# print itemsfor item in items:print item[0].encode(type),item[1].encode(type),item[2].encode(type)except urllib2.URLError, e:if hasattr(e,"code"):print e.codeif hasattr(e,"reason"):print e.reason
在pycharm中的运行结果:
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: