python爬虫框架scrapy学习图片下载
2016-07-26 00:00
1176 查看
摘要: 以前抓取页面数据,我会使用requests和bs4的配合,发现python有个scrapy爬虫框架非常方便,这里学习记录一下。
文档地址:http://scrapy-chs.readthedocs.io/zh_CN/latest/topics/images.html
实践例子:
目的:抓取http://www.hlhua.com/页面里面商品的图片
根据文档所说,先创建item用来保存图片数据,为了能够使ImagesPipeLine生效,这个item需要有名为image_urls的field属性:
items.py
继承ImagesPipeLine编写自己的ImagesPipeLine
pipeline.py
这里重写的item_completed用来在下载完成后保存image_path属性
3. 编辑settings.py使能MyImageDownloadPipeLine
settings.py
编写spider实现爬虫逻辑
spider.py
执行scrapy crawl hlhua -o images.json,即可在image/full/下载图片,并生成images.json记录图片信息。
github: https://github.com/chenglp1215/scrapy_demo/tree/master/imagedemo
文档地址:http://scrapy-chs.readthedocs.io/zh_CN/latest/topics/images.html
实践例子:
目的:抓取http://www.hlhua.com/页面里面商品的图片
根据文档所说,先创建item用来保存图片数据,为了能够使ImagesPipeLine生效,这个item需要有名为image_urls的field属性:
items.py
import scrapy class MyItem(scrapy.Item): image_urls = scrapy.Field() image_paths = scrapy.Field() images = scrapy.Field()
继承ImagesPipeLine编写自己的ImagesPipeLine
pipeline.py
import scrapy from scrapy.exceptions import DropItem from scrapy.pipelines.images import ImagesPipeline class MyImageDownloadPipeLine(ImagesPipeline): def get_media_requests(self, item, info): for image_url in item['image_urls']: yield scrapy.Request(image_url) def item_completed(self, results, item, info): image_paths = [x['path'] for ok, x in results if ok] if not image_paths: raise DropItem("Item contains no images") item['image_paths'] = image_paths return item
这里重写的item_completed用来在下载完成后保存image_path属性
3. 编辑settings.py使能MyImageDownloadPipeLine
settings.py
# coding=utf-8 BOT_NAME = 'imagedemo' SPIDER_MODULES = ['imagedemo.spiders'] NEWSPIDER_MODULE = 'imagedemo.spiders' # 使能ImagePipeLine ITEM_PIPELINES = {'imagedemo.pipelines.MyImageDownloadPipeLine': 1} # 指定图片文件保存的未知 IMAGES_STORE = 'image' ROBOTSTXT_OBEY = True
编写spider实现爬虫逻辑
spider.py
# coding=utf-8 from scrapy.spiders import Spider from imagedemo.items import MyItem class ImageSpider(Spider): name = 'hlhua' start_urls = ['http://www.hlhua.com/'] def parse(self, response): # inspect_response(response, self) images = [] for each in response.xpath("//img[@class='goodsimg']/@src").extract(): m = MyItem() m['image_urls'] = [each,] images.append(m) return images
执行scrapy crawl hlhua -o images.json,即可在image/full/下载图片,并生成images.json记录图片信息。
github: https://github.com/chenglp1215/scrapy_demo/tree/master/imagedemo
相关文章推荐
- Python动态类型的学习---引用的理解
- Python3写爬虫(四)多线程实现数据爬取
- 垃圾邮件过滤器 python简单实现
- 下载并遍历 names.txt 文件,输出长度最长的回文人名。
- install and upgrade scrapy
- install scrapy with pip and easy_install
- Scrapy的架构介绍
- Centos6 编译安装Python
- 使用Python生成Excel格式的图片
- 让Python文件也可以当bat文件运行
- [Python]推算数独
- Python中zip()函数用法举例
- Python中map()函数浅析
- Python将excel导入到mysql中
- Python在CAM软件Genesis2000中的应用
- 使用Shiboken为C++和Qt库创建Python绑定