您的位置：首页 > 理论基础 > 计算机网络

《用Python写网络爬虫》--编写第一个网络爬虫

2017-03-30 21:18 585 查看

编写第一个python网络爬虫

为了抓取网页，首先要下载包含有感兴趣数据的网页，该过程一般被称为爬取(crawing)。

本文主要介绍了利用sitemap文件，遍历ID，跟踪网页的方法获取网页内容。

下载网页

想要爬取网页，我们首先要将其下载下来。下载的脚本如下：

import urllib2
def download(url):
return urllib2.urlopen(url).read()

当传入URL地址时，该函数将会下载并返回其HTML。

不过这个代码片存在一点问题，假如URL地址不存在时，urllib2就会抛出异常。改进的版本为：

import urllib2
def download(url):
try:
html=urllib2.urlopen(url).read()
except urllib2.URLError as e:
print 'Downloading error:',e.reason
html=None
return html
print download('http://www.sse.com.cn')

下载时遇到的错误经常是临时性的，比如服务器过载时返回的503错误，对于此类错误，重新下载即可，下面是添加重新重新下载功能的代码：

import urllib2
def download(url,num_reload=5):
try:
html=urllib2.urlopen(url).read()
except urllib2.URLError as e:
print 'Downloading error:',e.reason
html = None
if num_reload>0 and ( hasattr(e,'code') and 500<=e.code<=600 ):
return download(url,num_reload-1)
return html

download('http://httpstat.us/500')

下面的这步可略过：

设置用户代理

默认情况下，urllib2使用Python-urllib2/2.7作为用户代理下载网页内容，其中2.7是python的版本号。

import urllib2
def download(url,user_agent='wswp',num_reload=5):
headers={'User-agent':user_agent}
request=urllib2.Request(url,headers=headers)
try:
html=urllib2.urlopen(request).read()
except urllib2.URLError as e:
print 'Downloading error:',e.reason
html = None
if num_reload>0 and ( hasattr(e,'code') and 500<=e.code<=600 ):
return download(url,user_agent,num_reload-1)
return html

download('http://httpstat.us/500')

现在，我们有了一个灵活的函数，可以设置用户代理，设置重试次数。

网站地图爬虫

使用robots.txt文件中的网站地图（即sitemap文件）下载所有的网页。

robots.txt

# section 1

User-agent: BadCrawler

Disallow: /

# section 2

User-agent: *

Crawl-delay: 5

Disallow: /trap

# section 3

Sitemap: http://example.webscraping.com/sitemap.xml

下面是使用sitemap文件爬虫的代码：

#!/usr/bin/python
#coding:utf-8
import urllib2
import re
def download(url,user_agent='wswp',num_reload=5):
headers={'User-agent':user_agent}
request=urllib2.Request(url,headers=headers)
try:
html=urllib2.urlopen(request).read()
except urllib2.URLError as e:
print 'Downloading error:',e.reason
html = None
if num_reload>0 and ( hasattr(e,'code') and 500<=e.code<=600 ):
return download(url,user_agent,num_reload-1)
return html

def crawl_sitemap(url):
sitemap = download(url) #下载网页文件
links = re.findall('<loc>(.*?)</loc>',sitemap) # 提取sitemap文件里的格式化连接
for link in links:
print link
html = download(link)
# print html

crawl_sitemap('http://example.webscraping.com/sitemap.xml')

下面是代码运行的效果：

ID遍历爬虫

http://example.webscraping.com/view/-%d

来匹配所有页面。下面是示例代码：

#!/usr/bin/python
#coding:utf-8
import urllib2
import itertools
def download(url,user_agent='wswp',num_reload=5):
headers={'User-agent':user_agent}
request=urllib2.Request(url,headers=headers)
try:
html=urllib2.urlopen(request).read()
except urllib2.URLError as e:
print 'Downloading error:',e.reason
html = None
if num_reload>0 and ( hasattr(e,'code') and 500<=e.code<=600 ):
return download(url,user_agent,num_reload-1)
return html

for page in itertools.count(1):
url='http://example.webscraping.com/view/%d'%page
html=download(url)
print url
if html is None:
break

链接爬虫

本次将使用正则表达式来确定要下载哪些页面。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航