学习笔记Python爬虫之Scrapy《二》
2018-06-01 10:38
225 查看
上一篇简单介绍了一下Scrapy爬虫的构成和运行原理。
本篇继上篇来创建一个简单Scrapy爬虫。(爬取豆瓣剧情电影排行榜)
本篇将分为四步来完成这个demo(环境:python2.7,pycharm,windows10)
一、找到爬取的网址:
https://movie.douban.com/j/chart/top_list?type=11&interval_id=100:90&start=0&limit=20,
创建项目
1、win+r—>创建项目的目录—>输入命令:scrapy startproject 项目名
创建后的
2、创建好项目后,创建spider文件
在pycharm中有替代后台的编辑器,可以直接使用
进入spider目录,输入:scrapy genspider doubanspider
如下:
第一步完成
二、分析网页,使用scrapy shell + 网址 获取网页内容,进行分析
三、开始编写Spider类
name 是爬虫的名字,我们可以随便改。
allowed_domains是爬虫运行爬取的范围
start_urls是开始爬取的链接。只是一个列表,所以我们可以设置多个开始链接,依次爬取
parse是重写的Spider的解析方法,用来处理下载下来的响应
分析:
https://movie.douban.com/typerank?type_name=%E5%89%A7%E6%83%85&type=11&interval_id=100:90&action=
这个是网页的网址,我们可以看到当我们拖动滚动条的时候,网址并不变化,但是数据在一直刷新,按下f12
当我们滚动条移动,加载数据时
我们发先真正的请求地址。这也就是我前面写的地址
下面附上代码:
spider类
import scrapy from douban.items import DoubanItem import json class DoubanspiderSpider(scrapy.Spider): name = 'douban' allowed_domains = ['movie.douban.com'] offset = 0 url = 'https://movie.douban.com/j/chart/top_list?type=11&interval_id=100%3A90&action=&start=' start_urls = ["https://movie.douban.com/j/chart/top_list?type=11&interval_id=100%3A90&action=&start=0&limit=20"] def parse(self, response): data = json.loads(response.text) for each in data: doubanitem = DoubanItem() doubanitem['title'] = each['title'] doubanitem['score'] = each['score'] doubanitem['is_playable'] = each['is_playable'] doubanitem['release_date'] = each['release_date'] doubanitem['rank'] = each['rank'] doubanitem['types'] = each['types'] doubanitem['regions'] = each['regions'] doubanitem['detail_url'] = each['cover_url'] yield doubanitem
if self.offset < 550: self.offset += 20 yield scrapy.Request(self.url + str(self.offset) + '&limit=20',callback=self.parse)
item类
class DoubanItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() title = scrapy.Field()#电影名称 score = scrapy.Field()#评分 is_playable = scrapy.Field()#是否可以观看 release_date = scrapy.Field()#上映时间 rank = scrapy.Field()#排名 types = scrapy.Field()#类型 regions = scrapy.Field()#国家 detail_url = scrapy.Field()#详情地址setting文件
# -*- coding: utf-8 -*- # Scrapy settings for douban project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://doc.scrapy.org/en/latest/topics/settings.html # https://doc.scrapy.org/en/latest/topics/downloader-middleware.html # https://doc.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'douban' SPIDER_MODULES = ['douban.spiders'] NEWSPIDER_MODULE = 'douban.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'douban (+http://www.yourdomain.com)' # Obey robots.txt rules ROBOTSTXT_OBEY = True # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs DOWNLOAD_DELAY = 1 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) #COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False # Override the default request headers: DEFAULT_REQUEST_HEADERS = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.117 Safari/537.36', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en', } # Enable or disable spider middlewares # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'douban.middlewares.DoubanSpiderMiddleware': 543, #} # Enable or disable downloader middlewares # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'douban.middlewares.DoubanDownloaderMiddleware': 543, #} # Enable or disable extensions # See https://doc.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} # Configure item pipelines # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'douban.pipelines.DoubanPipeline': 300, } # Enable and configure the AutoThrottle extension (disabled by default) # See https://doc.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
pipeline类
import json class DoubanPipeline(object): def __init__(self): self.filename = open('douban.json','w') def process_item(self, item, spider): text = json.dumps(dict(item), ensure_ascii=False) + ",\n" self.filename.write(text.encode("utf-8")) return item def close_spide(self, spider): self.filename.close()
if self.offset < 550: self.offset += 20 yield scrapy.Request(self.url + str(self.offset) + '&limit=20',callback=self.parse)
相关文章推荐
- Python爬虫框架Scrapy 学习笔记 4 ------- 第二个Scrapy项目
- python爬虫框架scrapy学习笔记
- python爬虫学习笔记六:Scrapy爬虫的使用步骤
- Python的Scrapy爬虫框架简单学习笔记
- Python爬虫框架Scrapy 学习笔记 9 ----selenium
- Python爬虫框架Scrapy 学习笔记 10.2 -------【实战】 抓取天猫某网店所有宝贝详情
- python 爬虫 学习笔记(一)Scrapy框架入门
- Python爬虫框架Scrapy 学习笔记 10.2 -------【实战】 抓取天猫某网店所有宝贝详情
- python3利用Scrapy实现爬虫--学习笔记
- Python的Scrapy爬虫框架简单学习笔记
- 学习笔记Python爬虫之Scrapy《三》(css、xpath)
- Python爬虫框架Scrapy 学习笔记 5 ------- 使用pipelines过滤敏感词
- Python爬虫框架Scrapy 学习笔记 1 ----- 环境搭建
- Python爬虫框架Scrapy 学习笔记 10.3 -------【实战】 抓取天猫某网店所有宝贝详情
- Python scrapy爬虫学习笔记01
- Python的学习笔记DAY7---关于爬虫(2)之Scrapy初探
- Python爬虫框架Scrapy 学习笔记 7------- scrapy.Item源码剖析
- Python爬虫框架Scrapy 学习笔记 8----Spider
- 学习python 中的scrapy爬虫框架艰辛路,不推荐看,主要纪录自己学习笔记的
- python爬虫框架scrapy学习笔记