您的位置：首页 > 编程语言 > Python开发

【学习记录】利用scrapy爬取论坛图片

2017-04-03 10:02 309 查看

本人编程新手，入门都谈不上，本文只作为学习scrapy的记录。代码参考了网上很多前辈的文章，对于他们的无私分享表示感谢。

本文着重参考了：http://blog.csdn.net/zhu_free/article/details/49176777 感谢@zhu_free

scrapy的安装配置就不说了，本人使用的编程环境为

@debian:~/yesky$ scrapy version -v

Scrapy : 1.3.3

lxml : 3.7.3.0

libxml2 : 2.9.3

cssselect : 1.0.1

parsel : 1.1.0

w3lib : 1.17.0

Twisted : 17.1.0

Python : 2.7.9 (default, Jun 29 2016, 13:08:31) - [GCC 4.9.2]

pyOpenSSL : 16.2.0 (OpenSSL 1.0.1t 3 May 2016)

Platform : Linux-3.16.0-4-amd64-x86_64-with-debian-8.7

正题：使用scrapy抓取论坛图片，以：http://pic.yesky.com/bbs/forum-22151-1.html 为例

首先创建工程：scrapy startproject yesky

进入工程文件夹，使用genspider命令创建一个爬虫：scrapy genspider yesky_spider http://pic.yesky.com/bbs/forum-22151-1.html

命令解释，参考scrapy用户手册genspider条目（pdf文件26ye），英文水平差，不翻译了：

Syntax: scrapy genspider [-t template] <name> <domain>

Create a new spider in the current folder or in the current project’s spiders folder, if called from inside a project.

The <name> parameter is set as the spider’s name, while <domain> is used to generate the allowed_domains

and start_urls spider’s attributes.

[-t template] 代表模板，是可选项，不填则使用默认模板创建爬虫

创建完成以后，工程文件夹下回自动生成一些列文件，首先编辑items.py

import scrapy

class YeskyItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
image_urls = scrapy.Field()
images = scrapy.Field()

image_urls用于保存图片url地址，images我也说不清，一会再查

编写settings.py，在70行附近，#ITEM_PIPELINES =下面另起一行，加入

ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline':1}
IMAGES_STORE = '/home/tang/yesky/pic'
IMAGES_EXPIRES = 15

ITEM_PIPELINES的设置，参考scrapy用户手册Downloading and processing files and images章节（pdf文件154页），因为是爬取图片，用ImagesPipeline，后面的数字1表示优先级，范围是0-1000，IMAGES_STORE设置图片保存位置，必须设置，否则pipeline不起作用，IMAGES_EXPIRES，这里的设置是15天之内下载过的文件不再下载

编写yesk_spider.py

文档的主题链接是相对地址，需要拼接

# -*- coding: utf-8 -*-
import scrapy
from yesky.items import YeskyItem #导入前面定义的item类

class YeskySpiderSpider(scrapy.Spider):
name = "yesky_spider" #爬虫名字
allowed_domains = ["pic.yesky.com"] #允许爬取的范围
start_urls = [] #爬取的起始地址
for page in range(1,140): #这一段是因为论坛有140页，用一个循环生成所有的待爬取的页面，保存到start_urls，测试可以把范围改小点，不然要爬很久很久
start_urls.append('http://pic.yesky.com/bbs/forum-22151-%s.html' % page)

def parse(self, response): #该函数用来处理列表页，将列表中所有主题的链接取出来
for sel in response.xpath("//a[contains(@href, 'thread')]/@href"): #xpath的使用参考w3cschool的教程，“找到文档中所有有href属性，并且
detaillink = sel.extract() #属性名字中包含thread的a标签，的href属性，并提取处理，有点绕口”
response.url.split('/forum-22151-1.html') #将response返回的URL后面一段去掉，只取前面部分，用于后面拼接
url = response.urljoin(detaillink) #将地址进行拼接，形成完整的地址
yield scrapy.Request(url, callback = self.parse_item) #返回地址，并调用parse_item函数

def parse_item(self, response): #访问parse返回的地址，并提取图片的地址
for link in response.xpath("//div[contains(@class, t_fsz)]//img/@file"):
item = YeskyItem()
detaillink = link.extract()
item['image_urls'] = [detaillink]
yield item

这一段说的肯定有问题，不清不楚的，主要是我个人理解比较浅，不好意思
编写pipelines.py，自动生成的就行了，不用做修改

class YeskyPipeline(object):
def process_item(self, item, spider):
return item

全部保存完以后，在工程文件夹下面运行：scrapy crawl yesky_spider，过一会就能看到下载的图片了

写的不好，欢迎拍砖，虽然写的不好，但是转载也请注明一下，谢谢

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： python scrapy debian

相关文章推荐

新的分享

章节导航