您的位置：首页 > 编程语言 > Python开发

python 爬虫学习笔记（1）

2017-04-07 10:08 405 查看

目标：爬取糗事百科的段子代码：

# -*- coding: utf-8 -*-
__author__ = 'beauty'

import sys
type = sys.getfilesystemencoding() #为了防止出现乱码

import urllib2import repage = 1url = 'http://www.qiushibaike.com/hot/page/' + str(page)user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'headers = { 'User-Agent' : user_agent }try:request = urllib2.Request(url,headers = headers)response = urllib2.urlopen(request)content = response.read().decode('utf-8')# print content.encode(type)pattern = re.compile('<div class="author clearfix">.*?href.*?<img src.*?title=.*?<h2>(.*?)</h2>.*?<div class="content">(.*?)</div>.*?<i class="number">(.*?)</i>',re.S)items = re.findall(pattern,content)# print itemsfor item in items:print item[0].encode(type),item[1].encode(type),item[2].encode(type)except urllib2.URLError, e:if hasattr(e,"code"):print e.codeif hasattr(e,"reason"):print e.reason

在pycharm中的运行结果：

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航