您的位置:首页 > 编程语言 > Python开发

Python爬虫框架Scrapy 学习笔记 5 ------- 使用pipelines过滤敏感词

2015-01-06 17:59 1036 查看
还是上一篇博客的那个网站,我们增加了pipeline.py
items.py

from scrapy.item import Item, Field

class Website(Item):

name = Field()
description = Field()
url = Field()


dmoz.py

from scrapy.spider import Spider
from scrapy.selector import Selector

from dirbot.items import Website

class DmozSpider(Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/",
]

def parse(self, response):
"""
The lines below is a spider contract. For more info see: http://doc.scrapy.org/en/latest/topics/contracts.html 
@url http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/ @scrapes name
"""
sel = Selector(response)
sites = sel.xpath('//ul[@class="directory-url"]/li')
items = []

for site in sites:
item = Website()
item['name'] = site.xpath('a/text()').extract()
item['url'] = site.xpath('a/@href').extract()
item['description'] = site.xpath('text()').re('-\s[^\n]*\\r')
items.append(item)

return items
注意description的xpath与上一次有所不同,这里删除了空格和换行

pipeline.py
from scrapy.exceptions import DropItem

class FilterWordsPipeline(object):
"""A pipeline for filtering out items which contain certain words in their
description"""

# put all words in lowercase
words_to_filter = ['politics', 'religion']

def process_item(self, item, spider):
for word in self.words_to_filter:
if word in unicode(item['description']).lower():
raise DropItem("Contains forbidden word: %s" % word)
else:
return item


作用:过滤description中包含 'politics'或'religion'的item
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: