Python爬虫框架Scrapy 学习笔记 5 ------- 使用pipelines过滤敏感词
2015-01-06 17:59
1036 查看
还是上一篇博客的那个网站,我们增加了pipeline.py
items.py
dmoz.py
pipeline.py
作用:过滤description中包含 'politics'或'religion'的item
items.py
from scrapy.item import Item, Field class Website(Item): name = Field() description = Field() url = Field()
dmoz.py
from scrapy.spider import Spider from scrapy.selector import Selector from dirbot.items import Website class DmozSpider(Spider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/", ] def parse(self, response): """ The lines below is a spider contract. For more info see: http://doc.scrapy.org/en/latest/topics/contracts.html @url http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/ @scrapes name """ sel = Selector(response) sites = sel.xpath('//ul[@class="directory-url"]/li') items = [] for site in sites: item = Website() item['name'] = site.xpath('a/text()').extract() item['url'] = site.xpath('a/@href').extract() item['description'] = site.xpath('text()').re('-\s[^\n]*\\r') items.append(item) return items注意description的xpath与上一次有所不同,这里删除了空格和换行
pipeline.py
from scrapy.exceptions import DropItem class FilterWordsPipeline(object): """A pipeline for filtering out items which contain certain words in their description""" # put all words in lowercase words_to_filter = ['politics', 'religion'] def process_item(self, item, spider): for word in self.words_to_filter: if word in unicode(item['description']).lower(): raise DropItem("Contains forbidden word: %s" % word) else: return item
作用:过滤description中包含 'politics'或'religion'的item
相关文章推荐
- Python爬虫框架Scrapy 学习笔记 8----Spider
- 学习python 中的scrapy爬虫框架艰辛路,不推荐看,主要纪录自己学习笔记的
- Python爬虫框架Scrapy 学习笔记 10.2 -------【实战】 抓取天猫某网店所有宝贝详情
- python爬虫框架scrapy学习笔记
- Python爬虫框架Scrapy 学习笔记 1 ----- 环境搭建
- Python的Scrapy爬虫框架简单学习笔记
- python 爬虫 学习笔记(一)Scrapy框架入门
- python爬虫框架scrapy学习笔记
- Python爬虫框架Scrapy 学习笔记 10.2 -------【实战】 抓取天猫某网店所有宝贝详情
- Python爬虫框架Scrapy 学习笔记 9 ----selenium
- Python爬虫框架Scrapy 学习笔记 10.1 -------【实战】 抓取天猫某网店所有宝贝详情
- Python爬虫框架Scrapy 学习笔记 4 ------- 第二个Scrapy项目
- 【python爬虫】scrapy框架笔记(一):创建工程,使用scrapy shell,xpath
- Python的Scrapy爬虫框架简单学习笔记
- Python爬虫框架Scrapy 学习笔记 10.3 -------【实战】 抓取天猫某网店所有宝贝详情
- Python爬虫框架Scrapy 学习笔记 7------- scrapy.Item源码剖析
- python3.4学习笔记(十三) 网络爬虫实例代码,使用pyspider抓取多牛投资吧里面的文章信息,抓取政府网新闻内容
- Python爬虫框架Scrapy安装使用步骤
- 零基础写python爬虫之使用Scrapy框架编写爬虫
- 使用python scrapy爬虫框架 爬取科学网自然科学基金数据