您的位置:首页 > 编程语言 > Python开发

python爬虫框架scrapy学习图片下载

2016-07-26 00:00 1176 查看
摘要: 以前抓取页面数据,我会使用requests和bs4的配合,发现python有个scrapy爬虫框架非常方便,这里学习记录一下。

文档地址:http://scrapy-chs.readthedocs.io/zh_CN/latest/topics/images.html

实践例子:
目的:抓取http://www.hlhua.com/页面里面商品的图片

根据文档所说,先创建item用来保存图片数据,为了能够使ImagesPipeLine生效,这个item需要有名为image_urls的field属性:
items.py

import scrapy

class MyItem(scrapy.Item):
image_urls = scrapy.Field()
image_paths = scrapy.Field()
images = scrapy.Field()


继承ImagesPipeLine编写自己的ImagesPipeLine
pipeline.py

import scrapy
from scrapy.exceptions import DropItem
from scrapy.pipelines.images import ImagesPipeline

class MyImageDownloadPipeLine(ImagesPipeline):

def get_media_requests(self, item, info):
for image_url in item['image_urls']:
yield scrapy.Request(image_url)

def item_completed(self, results, item, info):
image_paths = [x['path'] for ok, x in results if ok]
if not image_paths:
raise DropItem("Item contains no images")
item['image_paths'] = image_paths
return item

这里重写的item_completed用来在下载完成后保存image_path属性
3. 编辑settings.py使能MyImageDownloadPipeLine
settings.py

# coding=utf-8
BOT_NAME = 'imagedemo'

SPIDER_MODULES = ['imagedemo.spiders']
NEWSPIDER_MODULE = 'imagedemo.spiders'

# 使能ImagePipeLine
ITEM_PIPELINES = {'imagedemo.pipelines.MyImageDownloadPipeLine': 1}
# 指定图片文件保存的未知
IMAGES_STORE = 'image'

ROBOTSTXT_OBEY = True


编写spider实现爬虫逻辑
spider.py

# coding=utf-8
from scrapy.spiders import Spider
from imagedemo.items import MyItem

class ImageSpider(Spider):
name = 'hlhua'
start_urls = ['http://www.hlhua.com/']

def parse(self, response):
# inspect_response(response, self)
images = []
for each in response.xpath("//img[@class='goodsimg']/@src").extract():
m = MyItem()
m['image_urls'] = [each,]
images.append(m)
return images


执行scrapy crawl hlhua -o images.json,即可在image/full/下载图片,并生成images.json记录图片信息。

github: https://github.com/chenglp1215/scrapy_demo/tree/master/imagedemo
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  python scrapy