您的位置：首页 > 数据库 > Mongodb

scrapy爬虫之item pipeline保存数据

2018-02-01 08:25 344 查看

简介

前面的博文我们都是使用”-o *.josn”参数将提取的item数据输出到json文件，若不加此参数提取到的数据则不会输出。其实当Item在Spider中被收集之后，它将会被传递到Item Pipeline，这些Item Pipeline组件按定义的顺序处理Item。当我们创建项目时，scrapy会生成一个默认的pipelines.py，如：

vim pipelines.py
class DoubanPipeline(object):
def process_item(self, item, spider):
return item

但是我们没有具体定义，因此执行爬虫并不会输出结果。

下面我们还是通过定义pipeline，使提取到的item通过pipeline输出到json文件、mongodb数据库。

本文爬虫以scrapy爬虫之crawlspide爬取豆瓣近一周同城活动为例，在此我们更新item、item pipeline即可。

输出到json文件

1.定义item

vim items.py
def filter_string(x):
str = x.split(':')
return str[1].strip()
class tongcheng(scrapy.Item):
#主题
title = scrapy.Field()
#时间
time = scrapy.Field()
#地址
address = scrapy.Field(output_processor=Join())
#票价
money = scrapy.Field()
#感兴趣人数
intrest = scrapy.Field()
#参加人数
join = scrapy.Field()

2.定义item pipeline

vim pipelines.py
#以json格式输出
from scrapy.exporters import JsonItemExporter
#以jl格式输出
#from scrapy.exporters import JsonLinesItemExporter
#以csv格式输出
#from scrapy.exporters import CsvItemExporter
class tongcheng_pipeline_json(object):
def open_spider(self, spider):
#可选实现，当spider被开启时，这个方法被调用。
#输出到tongcheng_pipeline.json文件
self.file = open('tongcheng_pipeline.json', 'wb')
self.exporter = JsonItemExporter(self.file, encoding='utf-8')
self.exporter.start_exporting()
def close_spier(selef, spider):
#可选实现，当spider被关闭时，这个方法被调用
self.exporter.finish_exporting()
self.file.close()
def process_item(self, item, spider):
self.exporter.export_item(item)
return item

3.激活item pipeline

我们的pipeline定义后，需要在配置文件中添加激活才能使用，因此我们需要配置settings.py。

vim settings.py
ITEM_PIPELINES = {
#默认使用这个，但我们没有定义，因此注释掉。
#'douban.pipelines.DoubanPipeline': 300,
#在此添加我们新定义的pipeline
'douban.pipelines.tongcheng_pipeline_json': 300,
}

4.启动爬虫

scrapy crawl tongcheng
#打印信息中会显示
2018-01-20 10:48:10 [scrapy.middleware] INFO: Enabled item pipelines:
['douban.pipelines.tongcheng_pipeline_json']
....

#查看tongcheng_pipeline.json文件
cat tongcheng_pipeline.json
[{"money": ["263元"], "address": "深圳  深圳市少年宫剧场 深圳市福田区福中一路市少年宫", "join": ["69 "], "intrest": ["174 "], "title": ["孟京辉戏剧作品《一个陌生女人的来信》深圳站"]},{"money": ["93 - 281元"], "address": "深圳  南山文体中心剧院 小剧场 深圳市南山区南山大道南山文体中心", "join": ["4 "], "intrest": ["11 "],"title": ["2018第五届城市戏剧节 诗·歌·舞变奏三幕剧《木心·人曲》-深圳"]}.....]

如上显示，则说明我们的爬虫调用了配置文件中的pipeline，

并将提取的item输出到tongcheng_pipeline.json文件了。

注意

1.在settings.py中设置的pipeline，会被project中的所有爬虫按照优先级默认调用，例如：

ITEM_PIPELINES = {
'douban.pipelines.DoubanPipeline': 300,
'douban.pipelines.movieTop250_crawlspider_json': 200,
'douban.pipelines.tongcheng_pipeline_json': 100,
}

当我们”scrapy crawl tongcheng”时，会按照优先级从低到高也就是100、200、300顺序调用pipeline，从打印信息中可以看到：

2018-01-20 10:48:10 [scrapy.middleware] INFO: Enabled item pipelines:
['douban.pipelines.tongcheng_pipeline_json',
douban.pipelines.movieTop250_crawlspider_json,
douban.pipelines.DoubanPipeline
]

2.不同spider绑定pipeline

由于一个project中有多个不同功能的爬虫，我们需要将爬虫绑定不同的pipeline，以将提取的内容保存到不同地方。如何实现？

我们知道scrapy运行会调用不同的配置文件，按照优先级从高到低为：

1.Command line options (most precedence)
2.Settings per-spider
3.Project settings module
4.Default settings per-command
5.Default global settings (less precedence

我们使用的settings.py属于“Project settings module”

，因此我们需要使用优先级比它高的配置文件即可实现绑定pipeline，例如“Settings per-spider”。

vim tongcheng.py
#在下面添加custom_settings即可
class TongchengSpider(CrawlSpider):
name = 'tongcheng'
allowed_domains = ['douban.com']
start_urls = ['https://www.douban.com/location/shenzhen/events/week-all']
custom_settings = {
'ITEM_PIPELINES': {
'douban.pipelines.tongcheng_pipeline_json': 300,
}
}
rules = (
Rule(LinkExtractor(allow=r'start=10')),
Rule(LinkExtractor(allow=r'https://www.douban.com/event/\d+/'),callback='parse_item'),
)

def parse_item(self, response):
loader = ItemLoader(item=tongcheng(),selector=response)
info = loader.nested_xpath('//div[@class="event-info"]')
info.add_xpath('title','h1[@itemprop="summary"]/text()')
info.add_xpath('time','div[@class="event-detail"]/ul[@class="calendar-strs"]/li/text()')
info.add_xpath('address','div[@itemprop="location"]/span[@class="micro-address"]/span[@class="micro-address"]/text()')
info.add_xpath('money','div[@class="event-detail"]/span[@itemprop="ticketAggregate"]/text()')
info.add_xpath('intrest','div[@class="interest-attend pl"]/span[1]/text()')
info.add_xpath('join','div[@class="interest-attend pl"]/span[3]/text()')

yield loader.load_item()

通过custom_settings我们可以绑定tongcheng_pipeline_json

，从而避免调用setttings.py中的所有pipeline。

输出到mongodb

由于是测试，我们在此使用docker安装并运行mongo

1.docker安装mongo

#查看镜像
sudo docker search mongo
#安装镜像
sudo docker pull mongo
#启动mongodb，将镜像端口27017映射到本地端口27017，挂在本地数据目录到镜像内的/data/db
sudo docker run --name scrapy-mongodb -p 27017:27017 -v /home/yanggd/docker/mongodb:/data/db -d mongo
#本地连接到mongo
sudo docker run -it mongo mongo --host 10.11.2.102

2.添加数据库链接参数到配置文件

vim ../settings.py
#最后添加
MONGO_HOST = '10.11.2.102'
MONGO_PORT = 27017
MONGO_DB = 'douban'

3.定义pipelines

vim pipelines.py
import pymongo
class tongcheng_pipeline_mongodb(object):
mongo_collection = "tongcheng"
def __init__(self, mongo_host, mongo_port, mongo_db):
self.mongo_host = mongo_host
self.mongo_port = mongo_port
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls, crawl):
return cls(
mongo_host = crawl.settings.get("MONGO_HOST"),
mongo_port = crawl.settings.get("MONGO_PORT"),
mongo_db = crawl.settings.get("MONGO_DB")
)
def open_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_host, self.mongo_port)
self.db = self.client[self.mongo_db]
def close_spider(self, spider):
self.client.close()
def process_item(self, item, spider):
tongchenginfo = dict(item)
self.db[self.mongo_collection].insert_one(tongchenginfo)
return item

3.绑定pipeline

由于project中有多个爬虫，因此我们需要通过custom_settings绑定pipeline。

vim tongcheng.py
#在下面添加custom_settings即可
class TongchengSpider(CrawlSpider):
name = 'tongcheng'
allowed_domains = ['douban.com']
start_urls = ['https://www.douban.com/location/shenzhen/events/week-all']
custom_settings = {
'ITEM_PIPELINES': {
'douban.pipelines.tongcheng_pipeline_mongodb': 300,
}
}
rules = (
Rule(LinkExtractor(allow=r'start=10')),
Rule(LinkExtractor(allow=r'https://www.douban.com/event/\d+/'),callback='parse_item'),
)

def parse_item(self, response):
loader = ItemLoader(item=tongcheng(),selector=response)
info = loader.nested_xpath('//div[@class="event-info"]')
info.add_xpath('title','h1[@itemprop="summary"]/text()')
info.add_xpath('time','div[@class="event-detail"]/ul[@class="calendar-strs"]/li/text()')
info.add_xpath('address','div[@itemprop="location"]/span[@class="micro-address"]/span[@class="micro-address"]/text()')
info.add_xpath('money','div[@class="event-detail"]/span[@itemprop="ticketAggregate"]/text()')
info.add_xpath('intrest','div[@class="interest-attend pl"]/span[1]/text()')
info.add_xpath('join','div[@class="interest-attend pl"]/span[3]/text()')

yield loader.load_item()

4.查看数据库

#本地连接到mongo
sudo docker run -it mongo mongo --host 10.11.2.102
> show dbs
admin   0.000GB
config  0.000GB
douban  0.000GB
local   0.000GB
> use douban
switched to db douban
> show collections
movietop250
tongcheng
> db.tongcheng.find()
{ "_id" : ObjectId("5a6319a76e85dc5a777131d2"), "join" : [ "69 " ], "intrest" : [ "175 " ], "title" : [ "孟京辉戏剧作品《一个陌生女人的来信》深圳站" ], "money" : [ "263元" ], "address" : "深圳  深圳市少年宫剧场 深圳市福田区福中一路市少年宫" }
{ "_id" : ObjectId("5a6319a96e85dc5a777131d3"), "join" : [ "4 " ], "intrest" : [ "11 " ], "title" : [ "2018第五届城市戏剧节 诗·歌·舞变奏三幕剧《木心·人曲》-深圳" ], "money" : [ "93 - 281元" ], "address" : "深圳  南山文体中心剧院 小剧场 深圳市南山区南山大道南山文体中心" }
{ "_id" : ObjectId("5a6319ab6e85dc5a777131d4"), "join" : [ "7 " ], "intrest" : [ "16 " ], "title" : [ "2018第五届城市戏剧节·焦媛X王安忆X茅盾文学奖《长恨歌》-深圳" ], "money" : [ "93 - 469元" ], "address" : "深圳  南山文体中心剧院大剧院 南山大道与南头街交汇处南山文体中心" }
......

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： scrapy mongodb pipline

相关文章推荐

新的分享

章节导航