您的位置:首页 > 其它

调试Scrapy过程中的心得体会

2016-10-17 17:51 447 查看
1.大量抓取网页时出现“Memory Error”解决办法:设置一个队列,每当爬虫空闲时才向队列中放入请求,例如:

from scrapy import signals, Spider
from scrapy.xlib.pydispatch import dispatcher

class ExampleSpider(Spider):
name = "example"
start_urls = ['http://www.example.com/']

def __init__(self, *args, **kwargs):
super(ExampleSpider, self).__init__(*args, **kwargs)
# connect the function to the spider_idle signal
dispatcher.connect(self.queue_more_requests, signals.spider_idle)

def queue_more_requests(self, spider):
# this function will run everytime the spider is done processing
# all requests/items (i.e. idle)

# get the next urls from your database/file
urls = self.get_urls_from_somewhere()

# if there are no longer urls to be processed, do nothing and the
# the spider will now finally close
if not urls:
return

# iterate through the urls, create a request, then send them back to
# the crawler, this will get the spider out of its idle state
for url in urls:
req = self.make_requests_from_url(url)
self.crawler.engine.crawl(req, spider)

def parse(self, response):
pass


More info on the spider_idle signal: http://doc.scrapy.org/en/latest/topics/signals.html#spider-idle
More info on debugging memory leaks: http://doc.scrapy.org/en/latest/topics/leaks.html
P.S.还有一种限定爬取深度的方法(貌似在settings.py中?)待研究

2.如果请求的url不存在(404),则不会有response对象返回,爬虫什么也没做

3.编码问题

pubmed_spider.py中

import sys
reload(sys)
#python默认环境编码时ascii
sys.setdefaultencoding("utf-8")


保证抓取到的数据是utf8格式的

pipeline.py中file = codecs.open('/%s.txt' % (item['name']), mode = 'w',encoding='utf-8')将数据以utf8格式存储
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: