您的位置:首页 > 数据库 > Mongodb

Python使用mongodb保存爬取豆瓣电影的数据过程解析

2019-08-14 10:51 821 查看

创建爬虫项目douban

scrapy startproject douban

设置items.py文件,存储要保存的数据类型和字段名称

# -*- coding: utf-8 -*-
import scrapy
class DoubanItem(scrapy.Item):
title = scrapy.Field()
# 内容
content = scrapy.Field()
# 评分
rating_num = scrapy.Field()
# 简介
quote = scrapy.Field()

设置爬虫文件doubanmovies.py

# -*- coding: utf-8 -*-
import scrapy
from douban.items import DoubanItem
class DoubanmoviesSpider(scrapy.Spider):
name = 'doubanmovies'
allowed_domains = ['movie.douban.com']
offset = 0
url = 'https://movie.douban.com/top250?start='
start_urls = [url + str(offset)]
def parse(self, response):
# print('*'*60)
# print(response.url)
# print('*'*60)
item = DoubanItem()
info = response.xpath("//div[@class='info']")
for each in info:
item['title'] = each.xpath(".//span[@class='title'][1]/text()").extract()
item['content'] = each.xpath(".//div[@class='bd']/p[1]/text()").extract()
item['rating_num'] = each.xpath(".//span[@class='rating_num']/text()").extract()
item['quote'] = each .xpath(".//span[@class='inq']/text()").extract()
yield item
# print(item)
self.offset += 25
if self.offset <= 250:
yield scrapy.Request(self.url + str(self.offset),callback=self.parse)

设置管道文件,使用mongodb数据库来保存爬取的数据。重点部分

# -*- coding: utf-8 -*-
from scrapy.conf import settings
import pymongo
class DoubanPipeline(object):
def __init__(self):
self.host = settings['MONGODB_HOST']
self.port = settings['MONGODB_PORT']
def process_item(self, item, spider):
# 创建mongodb客户端连接对象,该例从settings.py文件里面获取mongodb所在的主机和端口参数,可直接书写主机和端口
self.client = pymongo.MongoClient(self.host,self.port)
# 创建数据库douban
self.mydb = self.client['douban']
# 在数据库douban里面创建表doubanmovies
# 把类似字典的数据转换为phthon字典格式
content = dict(item)
# 把数据添加到表里面
self.mysheetname.insert(content)
return item

设置settings.py文件

# -*- coding: utf-8 -*-
BOT_NAME = 'douban'
SPIDER_MODULES = ['douban.spiders']
NEWSPIDER_MODULE = 'douban.spiders'
USER_AGENT = 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;'
# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
COOKIES_ENABLED = False
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'douban.pipelines.DoubanPipeline': 300,
}
# mongodb数据库设置变量
MONGODB_HOST = '127.0.0.1'
MONGODB_PORT = 27017

终端测试

scrapy crawl douban

这博客园的代码片段缩进,难道要用4个空格才可以搞定?我发现只能使用4个空格才能解决如上图的代码块的缩进

以上就是本文的全部内容,希望对大家的学习有所帮助

您可能感兴趣的文章:

内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息