您的位置:首页 > 其它

一个简单的scrapy爬虫抓取豆瓣刘亦菲的图片地址

2014-10-04 13:40 567 查看
一.第一步是创建一个scrapy项目

sh-3.2# scrapy startproject liuyifeiImage

sh-3.2# chmod -R 777 liuyifeiImage/


二.分析图片特征

1.解决分页url部分:

我们爬虫的start_url是"http://movie.douban.com/celebrity/1049732/photos/?type=C&start=0&sortby=vote&size=a&subtype=a",

第二页地址是"http://movie.douban.com/celebrity/1049732/photos/?type=C&start=40&sortby=vote&size=a&subtype=a",

第三页是"http://movie.douban.com/celebrity/1049732/photos/?type=C&start=80&sortby=vote&size=a&subtype=a",能显而易见得到豆瓣图片的分页规则,因此我们的start_urls可以用一个for循环把所有的页面的url放进来。

start_urls = [];

for i in range(0,1120,40):
start_urls.append('http://movie.douban.com/celebrity/1049732/photos/
type=C&start=%d&sortby=vote&size=a&subtype=a'%i)


2.解决每一页的图片url部分:

我们在"http://movie.douban.com/celebrity/1049732/photos/?type=C&start=0&sortby=vote&size=a&subtype=a"这一页来分析,审查第一张图片的页面元素

sh-3.2# scrapy crawl liuyifei -o image.json -t json

/Users/lsf/PycharmProjects/liuyifeiImage/liuyifeiImage/spiders/liuyifei.py:8: ScrapyDeprecationWarning: liuyifeiImage.spiders.liuyifei.liuyifeiImage inherits from deprecated class scrapy.spider.BaseSpider, please inherit from scrapy.spider.Spider. (warning only on first subclass, there may be others)
class liuyifeiImage(BaseSpider):
2014-10-04 12:57:37+0800 [scrapy] INFO: Scrapy 0.24.4 started (bot: liuyifeiImage)
2014-10-04 12:57:37+0800 [scrapy] INFO: Optional features available: ssl, http11
2014-10-04 12:57:37+0800 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'liuyifeiImage.spiders', 'FEED_FORMAT': 'json', 'SPIDER_MODULES': ['liuyifeiImage.spiders'], 'FEED_URI': 'image.json', 'BOT_NAME': 'liuyifeiImage'}
2014-10-04 12:57:37+0800 [scrapy] INFO: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-10-04 12:57:37+0800 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-10-04 12:57:37+0800 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-10-04 12:57:37+0800 [scrapy] INFO: Enabled item pipelines:
2014-10-04 12:57:37+0800 [liuyifei] INFO: Spider opened
2014-10-04 12:57:37+0800 [liuyifei] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-10-04 12:57:37+0800 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6026
2014-10-04 12:57:37+0800 [scrapy] DEBUG: Web service listening on 127.0.0.1:6083
2014-10-04 12:57:38+0800 [liuyifei] DEBUG: Crawled (200) <GET http://movie.douban.com/celebrity/1049732/photos/?type=C&start=240&sortby=vote&size=a&subtype=a> (referer: None)
/Users/lsf/PycharmProjects/liuyifeiImage/liuyifeiImage/spiders/liuyifei.py:17: ScrapyDeprecationWarning: scrapy.selector.HtmlXPathSelector is deprecated, instantiate scrapy.Selector instead.
hxs=HtmlXPathSelector(response)
/Users/lsf/PycharmProjects/liuyifeiImage/liuyifeiImage/spiders/liuyifei.py:18: ScrapyDeprecationWarning: Call to deprecated function select. Use .xpath() instead.
sites=hxs.select('//ul/li/div/a/img/@src').extract()
/Library/Python/2.7/site-packages/Scrapy-0.24.4-py2.7.egg/scrapy/selector/unified.py:106: ScrapyDeprecationWarning: scrapy.selector.HtmlXPathSelector is deprecated, instantiate scrapy.Selector instead.
for x in result]
2014-10-04 12:57:38+0800 [liuyifei] DEBUG: Scraped from <200 http://movie.douban.com/celebrity/1049732/photos/?type=C&start=240&sortby=vote&size=a&subtype=a> {'ImageAddress': u'http://img3.douban.com/view/photo/raw/public/p2179423125.jpg'}
2014-10-04 12:57:38+0800 [liuyifei] DEBUG: Scraped from <200 http://movie.douban.com/celebrity/1049732/photos/?type=C&start=240&sortby=vote&size=a&subtype=a> {'ImageAddress': u'http://img3.douban.com/view/photo/raw/public/p2179423105.jpg'}
2014-10-04 12:57:38+0800 [liuyifei] DEBUG: Scraped from <200 http://movie.douban.com/celebrity/1049732/photos/?type=C&start=240&sortby=vote&size=a&subtype=a> {'ImageAddress': u'http://img3.douban.com/view/photo/raw/public/p2179423084.jpg'}

...

2014-10-04 13:34:17+0800 [liuyifei] DEBUG: Scraped from <200 http://movie.douban.com/celebrity/1049732/photos/?type=C&start=1040&sortby=vote&size=a&subtype=a> {'ImageAddress': u'http://img3.douban.com/view/photo/raw/public/p958573512.jpg'}
2014-10-04 13:34:17+0800 [liuyifei] DEBUG: Scraped from <200 http://movie.douban.com/celebrity/1049732/photos/?type=C&start=1040&sortby=vote&size=a&subtype=a> {'ImageAddress': u'http://img5.douban.com/view/photo/raw/public/p958572938.jpg'}
2014-10-04 13:34:17+0800 [liuyifei] INFO: Closing spider (finished)
2014-10-04 13:34:17+0800 [liuyifei] INFO: Stored json feed (1120 items) in: image.json
2014-10-04 13:34:17+0800 [liuyifei] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 8331,
'downloader/request_count': 28,
'downloader/request_method_count/GET': 28,
'downloader/response_bytes': 221405,
'downloader/response_count': 28,
'downloader/response_status_count/200': 28,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 10, 4, 5, 34, 17, 736723),
'item_scraped_count': 1120,
'log_count/DEBUG': 1150,
'log_count/INFO': 8,
'response_received_count': 28,
'scheduler/dequeued': 28,
'scheduler/dequeued/memory': 28,
'scheduler/enqueued': 28,
'scheduler/enqueued/memory': 28,
'start_time': datetime.datetime(2014, 10, 4, 5, 34, 14, 681268)}
2014-10-04 13:34:17+0800 [liuyifei] INFO: Spider closed (finished)


View Code
以下是json文件和txt文件:

image.json:



liuyifei_pic_address.txt

内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: