您的位置：首页 > 编程语言 > Python开发

scrapy递归抓取网页数据

2014-06-26 14:49 267 查看

scrapy spider的parse方法可以返回两种值：BaseItem，或者Request。通过Request可以实现递归抓取。

如果要抓取的数据在当前页，可以直接解析返回item（代码中带**注释的行直接改为yield item）；

如果要抓取的数据在当前页指向的页面，则返回Request并指定parse_item作为callback；

如果要抓取的数据当前页有一部分，指向的页面有一部分（比如博客或论坛，当前页有标题、摘要和url，详情页面有完整内容）这种情况需要用Request的meta参数把当前页面解析到的数据传到parse_item，后者继续解析item剩下的数据。

要抓完当前页再抓其它页面（比如下一页），可以返回Request，callback为parse。

有点奇怪的是：parse不能返回item列表，但作为callback的parse_item却可以，不知道为啥。

另外，直接extract()得到的文字不包含<a>等子标签的内容，可改为d.xpath('node()').extract()，得到的是包含html的文本，再过滤掉标签就是纯文本了。

没找到直接得到html的方法。

from scrapy.spider import Spider
from scrapy.selector import Selector

from dirbot.items import Article

import json
import re
import string
from scrapy.http import Request

class YouyousuiyueSpider(Spider):
name = "youyousuiyue2"
allowed_domains = ["youyousuiyue.sinaapp.com"]

start_urls = [
'http://youyousuiyue.sinaapp.com',
]

def load_item(self, d):
item = Article()
title = d.xpath('header/h1/a')
item['title'] = title.xpath('text()').extract()
print item['title'][0]
item['url'] = title.xpath('@href').extract()
return item

def parse_item(self, response):
item = response.meta['item']

sel = Selector(response)
d = sel.xpath('//div[@class="entry-content"]/div')
item['content'] = d.xpath('text()').extract()
return item

def parse(self, response):
"""
The lines below is a spider contract. For more info see: http://doc.scrapy.org/en/latest/topics/contracts.html 
@url http://youyousuiyue.sinaapp.com @scrapes name
"""

print 'parsing ', response.url
sel = Selector(response)
articles = sel.xpath('//div[@id="content"]/article')
for d in articles:
item = self.load_item(d)
yield Request(item['url'][0], meta={'item':item}, callback=self.parse_item) # ** or yield item

sel = Selector(response)
link = sel.xpath('//div[@class="nav-previous"]/a/@href').extract()[0]
if link[-1] == '4':
return
else:
print 'yielding ', link
yield Request(link, callback=self.parse)

详细代码见：https://github.com/junglezax/dirbot

参考：

http://doc.scrapy.org/en/latest/intro/tutorial.html

http://www.icultivator.com/p/3166.html

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： python scrapy 递归 parse yield

相关文章推荐

新的分享

章节导航