您的位置:首页 > 编程语言 > Python开发

学习笔记Python爬虫之Scrapy《二》

2018-06-01 10:38 225 查看

上一篇简单介绍了一下Scrapy爬虫的构成和运行原理。

本篇继上篇来创建一个简单Scrapy爬虫。(爬取豆瓣剧情电影排行榜)

本篇将分为四步来完成这个demo(环境:python2.7,pycharm,windows10)

一、找到爬取的网址:

https://movie.douban.com/j/chart/top_list?type=11&interval_id=100:90&start=0&limit=20,

创建项目

1、win+r—>创建项目的目录—>输入命令:scrapy startproject 项目名

创建后的


2、创建好项目后,创建spider文件


在pycharm中有替代后台的编辑器,可以直接使用

进入spider目录,输入:scrapy genspider doubanspider 

如下:


第一步完成

二、分析网页,使用scrapy shell + 网址 获取网页内容,进行分析

三、开始编写Spider类


name 是爬虫的名字,我们可以随便改。

allowed_domains是爬虫运行爬取的范围

start_urls是开始爬取的链接。只是一个列表,所以我们可以设置多个开始链接,依次爬取

parse是重写的Spider的解析方法,用来处理下载下来的响应

分析:

  https://movie.douban.com/typerank?type_name=%E5%89%A7%E6%83%85&type=11&interval_id=100:90&action=

    这个是网页的网址,我们可以看到当我们拖动滚动条的时候,网址并不变化,但是数据在一直刷新,按下f12

    

当我们滚动条移动,加载数据时


我们发先真正的请求地址。这也就是我前面写的地址

下面附上代码:

spider类
import scrapy
from douban.items import DoubanItem
import json
class DoubanspiderSpider(scrapy.Spider):
name = 'douban'
allowed_domains = ['movie.douban.com']
offset = 0
url = 'https://movie.douban.com/j/chart/top_list?type=11&interval_id=100%3A90&action=&start='
start_urls = ["https://movie.douban.com/j/chart/top_list?type=11&interval_id=100%3A90&action=&start=0&limit=20"]

def parse(self, response):
data = json.loads(response.text)
for each in data:
doubanitem = DoubanItem()
doubanitem['title'] = each['title']
doubanitem['score'] = each['score']
doubanitem['is_playable'] = each['is_playable']
doubanitem['release_date'] = each['release_date']
doubanitem['rank'] = each['rank']
doubanitem['types'] = each['types']
doubanitem['regions'] = each['regions']
doubanitem['detail_url'] = each['cover_url']
yield doubanitem
    if self.offset < 550:
            self.offset += 20
    yield scrapy.Request(self.url + str(self.offset) + '&limit=20',callback=self.parse)
item类
class DoubanItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()#电影名称
score = scrapy.Field()#评分
is_playable = scrapy.Field()#是否可以观看
release_date = scrapy.Field()#上映时间
rank = scrapy.Field()#排名
types = scrapy.Field()#类型
regions = scrapy.Field()#国家
detail_url = scrapy.Field()#详情地址
setting文件
# -*- coding: utf-8 -*-

# Scrapy settings for douban project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'douban'

SPIDER_MODULES = ['douban.spiders']
NEWSPIDER_MODULE = 'douban.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'douban (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 1
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.117 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'douban.middlewares.DoubanSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'douban.middlewares.DoubanDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'douban.pipelines.DoubanPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

pipeline类
import json

class DoubanPipeline(object):
def __init__(self):
self.filename = open('douban.json','w')

def process_item(self, item, spider):
text = json.dumps(dict(item), ensure_ascii=False) + ",\n"
self.filename.write(text.encode("utf-8"))
return item

def close_spide(self, spider):
self.filename.close()

if self.offset < 550: self.offset += 20 yield scrapy.Request(self.url + str(self.offset) + '&limit=20',callback=self.parse)




阅读更多
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: