您的位置：首页 > 编程语言 > Python开发

关于scrapy爬虫框架

2017-07-02 09:48 274 查看

一、选择一个网站

假设要从Mininova网站中提取所有今天添加的文件的url,name,description和size

网址为 http://www.mininova.org/today
二、定义数据

定义要抓取的数据，通过 Scrapy Items 来实现

例子：（BT文件--bit torrent，比特洪流）

【Python】

from scrapy.item import Item, Field



class TorrentItem(Item):

    url = Field()

    name = Field()

    description = Field()

    size = Field()

三、撰写蜘蛛

1、查看初始网址的源代码

2、查找url的规律（例子：http://www.mininova.org/tor/+数字，可以利用正则表达式 "/tor/\d+" 来提取所有文件的url地址）

3、构建一个Xpath去选择我们需要的数据,name, description 和size

【HTML 源码】

<h1>Darwin - The Evolution Of An Exhibition</h1>

<h2>Description:</h2>



<div id="description">

Short documentary made for Plymouth City Museum and Art Gallery regarding the setup of an exhibit about Charles Darwin in conjunction with the 200th anniversary of his birth.



...

<div id="specifications">

<p>

<strong>Category:</strong>

<a href="/cat/4">Movies</a> > <a href="/sub/35">Documentary</a>

</p>

<p>

<strong>Total size:</strong>

150.62 megabyte</p>

从上面代码中，可以发现name在<h1>里面

它的Xpath表达式为：//h1/text()

description在id="description"的div标签里

它的Xpath表达式为：//div[@id='description']

size它在id="specifications"的div标签中的第2个p标签里

它的Xpath表达式为：//div[@id='specification']/p[2]/text()[2]

最后，爬虫的代码如下（python）

from scrapy.contrib.spiders import CrawlSpider, Rule

from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

from scrapy.selector import Selector



class MininovaSpider(CrawlSpider):



    name = 'mininova'

    allowed_domains = ['mininova.org']

    start_urls = ['http://www.mininova.org/today']

    rules = [Rule(SgmlLinkExtractor(allow=['/tor/\d+']), 'parse_torrent')]



    def parse_torrent(self, response):

        sel = Selector(response)

        torrent = TorrentItem()

        torrent['url'] = response.url

        torrent['name'] = sel.xpath("//h1/text()").extract()

        torrent['description'] = sel.xpath("//div[@id='description']").extract()

        torrent['size'] = sel.xpath("//div[@id='specification']/p[2]/text()[2]").extract()

        return torrent

四、执行爬虫提取数据

将爬取得到的数据，以json格式保存到scraped_data.json文件中

scrapy crawl mininova -o scraped_data.json -t json

这里用feed export来生成json文件

【Scrapy自带了Feed输出，并且支持多种序列化格式(serialization
format)及存储方式(storage backends)。】

五、回顾抓取数据

Selectors 返回的是一个列表(lists)

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： python 爬虫

相关文章推荐

新的分享

章节导航