您的位置:首页 > 编程语言 > Python开发

Scrapy:Python的爬虫框架

2015-06-29 00:29 671 查看
使用Scrapy可以很方便的完成网上数据的采集工作,它为我们完成了大量的工作,而不需要自己费大力气去开发。

Items are containers that will be loaded with the scraped data;

Spiders are classes that you define and Scrapy uses to scrape information from a domain

They define an initial list of URLs to download

Scrapy Engine

The engine is responsible for ‍controlling the data flow between all components of the system, and triggering events when certain actions occur.‍

用来处理整个系统的数据流处理,触发事务。

Scheduler

The Scheduler ‍receives requests from the engine and enqueues them for feeding them later when the engine requests them. ‍

用来接受引擎发过来的请求,压入队列中,并在引擎再次请求的时候返回。

Downloader

The Downloader is responsible for fetching web pages and feeding them to the engine which, in turn, feeds them to the spiders.

用于下载网页内容,并将网页内容返回给蜘蛛。

Spiders

Spiders are custom classes written by Scrapy users to parse responses and extract items from them or additional URLs to follow. Each spider is able to handle a specific domain .

用它来制订特定域名或网页的解析规则。

Item Pipeline

The Item Pipeline is responsible for processing the items once they have been extracted by the spiders. Typical tasks include cleansing, validation and persistence.

负责处理有蜘蛛从网页中抽取的项目,他的主要任务是清晰、验证和存储数据。

Downloader middlewares

Downloader middlewares are specific hooks that sit between the Engine and the Downloader and process requests when they pass from the Engine to the Downloader, and responses that pass from Downloader to the Engine. They provide a convenient mechanism for extending Scrapy functionality by plugging custom code.

Scrapy引擎和下载器之间的钩子框架,主要是处理Scrapy引擎与下载器之间的请求及响应。

Spider middlewares

Spider middlewares are specific hooks that sit between the Engine and the Spiders and are able to process spider input and output. They provide a convenient mechanism for extending Scrapy functionality by plugging custom code.

Scrapy引擎和蜘蛛之间的钩子框架,主要工作是处理蜘蛛的响应输入和请求输出。
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: