您的位置：首页 > 产品设计 > UI/UE

【Scrapy】 Requests 和 Response 学习记录五

2017-01-01 00:00 295 查看

scrapy 采用 Request 和 Response 对网站进行抓取。

Request 对象

class scrapy.http.Request(url[, callback, method='GET', headers, body, cookies, meta, encoding='utf-8', priority=0, dont_filter=False, errback])

Request 通常是一个 http 请求，由 spider 产生并由 Downloader 执行，产生一个 Response。

参数：

url（string）请求的url

callback（callable）回调函数

method（string）请求的http请求方式，默认是get

meta（dict） Request.meta 属性的初始值。如果给定，参数的字典将会被传输。

body（str or unicode）请求体（页面源码）。

headers（dict）请求的消息头。字典的值可以是字符串（string）或者集合（list）。如果为 None , HTTP 请求消息头不会全部发送。

cookies（dict 或者 list）请求 cookies 。 cookies 以两种方式发送。

** Using a dict ：**

request_with_cookies = Request(url="http://www.example.com",
cookies={'currency': 'USD', 'country': 'UY'})

** Using a list of dicts:**

request_with_cookies = Request(url="http://www.example.com",
cookies=[{'name': 'currency',
'value': 'USD',
'domain': 'example.com',
'path': '/currency'}])

后面这种形式可以定制 cookie 的 domain 和 path 属性。只有 cookies 为接下来的请求保存的时候才是有用的。

当一些站点返回 cookies（in a response），这些cookies都是为域名（domain）存储的，将会被发送到接下的请求中。这是传统的浏览器的行为。但是，如果，如果你想避免现存cookies的产生，你可以在setting中设置，也可以在Request参数中设置：

request_with_cookies = Request(url="http://www.example.com",
cookies={'currency': 'USD', 'country': 'UY'},
meta={'dont_merge_cookies': True})

encoding（string）请求的编码（默认是utf-8）。

priority（int）请求的优先级（默认是0）。

dont_filter(boolean) 表名这个请求是否被调度器进行过滤。当你想对一个请求请求多次的时候，不理会去重过滤器。小心使用它，不然你可能陷入抓取循环当中。默认是False。

errback（callable）异常处理的回调函数。

给回调函数传递数据

当请求的响应被下载的时候，就会调用请求的回调函数。
例如：

def parse_page1(self, response):
return scrapy.Request("http://www.example.com/some_page.html",
callback=self.parse_page2)

def parse_page2(self, response):
# this would log http://www.example.com/some_page.html self.logger.info("Visited %s", response.url)

也在某些情况下，你可能想要在回调函数中传输参数，你可以在回调的函数中接收参数，你可以使用Request.meta。

例如：

def parse_page1(self, response):
item = MyItem()
item['main_url'] = response.url
request = scrapy.Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
request.meta['item'] = item
return request

def parse_page2(self, response):
item = response.meta['item']
item['other_url'] = response.url
return item

使用errbacks 来捕获请求处理中的异常

请求的errback是一个函数被调用，当处理中有异常发生。

它接收一个 Twisted Failure 实力作为第一个参数，并且被用来跟踪超时和DNS错误等。

例如：

import scrapy

from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import DNSLookupError
from twisted.internet.error import TimeoutError, TCPTimedOutError

class ErrbackSpider(scrapy.Spider):
name = "errback_example"
start_urls = [
"http://www.httpbin.org/",              # HTTP 200 expected
"http://www.httpbin.org/status/404",    # Not found error
"http://www.httpbin.org/status/500",    # server issue
"http://www.httpbin.org:12345/",        # non-responding host, timeout expected
"http://www.httphttpbinbin.org/",       # DNS error expected
]

def start_requests(self):
for u in self.start_urls:
yield scrapy.Request(u, callback=self.parse_httpbin,
errback=self.errback_httpbin,
dont_filter=True)

def parse_httpbin(self, response):
self.logger.info('Got successful response from {}'.format(response.url))
# do something useful here...

def errback_httpbin(self, failure):
# log all failures
self.logger.error(repr(failure))

# in case you want to do something special for some errors,
# you may need the failure's type:

if failure.check(HttpError):
# these exceptions come from HttpError spider middleware
# you can get the non-200 response
response = failure.value.response
self.logger.error('HttpError on %s', response.url)

elif failure.check(DNSLookupError):
# this is the original request
request = failure.request
self.logger.error('DNSLookupError on %s', request.url)

elif failure.check(TimeoutError, TCPTimedOutError):
request = failure.request
self.logger.error('TimeoutError on %s', request.url)

Request.meta special keys

Request.meta 属性可以包含任意的数据，但是有一些特殊的关键字被Scrapy 配置只用和它的built-in扩展。

如下：

dont_redirect

dont_retry

handle_httpstatus_list

handle_httpstatus_all

dont_merge_cookies (see cookies parameter of Request constructor)

cookiejar

dont_cache

redirect_urls

bindaddress

dont_obey_robotstxt

download_timeout

download_maxsize

download_latency

proxy

FormRequest objects

FormRequest 类继承 Request。

class scrapy.http.FormRequest(url[, formdata, ...])

FormRequest 类增加了一个新的参数在构造中，剩余的参数和Request中的一样，不过在这里就不一一列举了。

参数：
** formdata**（dict or iterable of tuples) 是一个字典（或者是键值对的接口）包含HTML Form 数据将会 url-encoded 和部署到请求体中。

FormRequest 对象支持 Request 中的类方法：

classmethod from_response(response[, formname=None, formnumber=0, formdata=None, formxpath=None, formcss=None, clickdata=None, dont_click=False, ...])

返回一个新的 FormRequest 对象，在这个相应中包含 HTML 页面中 <form>元素中的数据，例如采用 FormRequest.from_response() 模拟登陆。

这种策略的是自动模拟点击，任何的表单提交看起来都好像是点击，例如 <input type="submit"> .虽然这是一个很好描述的行为，但是总会出现一些问题导致很难调试。比如，有些表单通过 javascript 实现对表单的体骄傲，默认的点击就不适用了。为了避免点击，可以设置 dont_click 参数为 ** True ** 。或者，你想控制点击为不是弃用它，你可以使用 clickdata 参数。

参数：

response (Response object)

formname (string) 如果使用，命名属性集合的表单将会被使用

formxpath (string) 如果使用，和xpath匹配的第一个表单将会被应用

formcss (string) 如果使用，和css选择器匹配的第一个表单将会被应用

formnumber (integer) 按表单的数字使用，如果返回多个表单，第一个默认是0

formdata (dict) 在表达那数据中重写的区域。如果这个区域在相应的<form>元素中已经显示，它的值将会被参数重写。

clickdata (dict)

dont_click (boolean)

Request应用实例

使用FormRequest用过HTTP post 发送数据

如果你想模拟HTML的表单POST请求和发送一对键值对，你可以返回一个FormRequest对象，如下：

return [FormRequest(url="http://www.example.com/post/action",
formdata={'name': 'John Doe', 'age': '27'},
callback=self.after_post)]

使用 FormRequest.from_response() 模拟登陆

例子：

import scrapy

class LoginSpider(scrapy.Spider):
name = 'example.com'
start_urls = ['http://www.example.com/users/login.php']

def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formdata={'username': 'john', 'password': 'secret'},
callback=self.after_login
)

def after_login(self, response):
# check login succeed before going on
if "authentication failed" in response.body:
self.logger.error("Login failed")
return

# continue scraping with authenticated session...

响应对象（Response objects）

class scrapy.http.Response(url[, status=200, headers=None, body=b'', flags=None, request=None])

参数：

url (string) – 响应 URL

status (integer) – 响应的HTTP 状态码，默认为 200.

headers (dict) – 响应的消息头， The dict values can be strings (for single valued headers) or lists (for multi-valued headers).

body (str) – the response body. It must be str, not unicode, unless you’re using a encoding-aware Response subclass, such as TextResponse.

flags (list) – Response.flags的初始值的集合 . If given, the list will be shallow copied.

request (Request object) – the initial value of the **Response.request **attribute. This represents the Request that generated this response.

headers

一个字典对象包含响应的消息头。可以通过 get（） 方法获得消息头的而第一个值或者 getlist（） 获得消息头的所有值。例如：

response.headers.getlist('Set-Cookie') # 获取消息头中所有的cookie值

request

这个请求对象产生这个响应。这个属性配置在Scrapy的引擎中，在响应和请求通过所有的下载中间件之后。
尤其是：

HTTP重定向会导致原始的请求（重定向前的URL）被分配给重定向的响应。

Response.request.url 不总是匹配 Response.url

这个属性仅仅在 spider 代码中，Spider 中间件中是适用的，但是在下载中间件和 ** response_downloaded** 信号的处理中不适用。

urljoin（url）

构造一个绝对url通过结合 Response的url和可能相对url。

urlparse.urljoin(response.url, url)

Response subclasses

TextResponse objects

class scrapy.http.TextResponse(url[, encoding[, ...]])

TextResponse 给 ** Response ** 类增加了编码的功能，也就是只用来对二进制数据进行编码，例如图片、声音或者是媒体文件。

TextResponse 支持新的构造参数，是对Response对象的补充，剩下的功能和Response类一样。

** TextResponse ** 支持一下的属性：

text
Response body, as unicode.
和 response.body.decode(response.encoding) 相同，但是在第一次调用后被缓存，所以你可以访问 response.text 很多次。

unicode(response.body) is not a correct way to convert response body to unicode: you would be using the system default encoding (typically ascii) instead of the response encoding.

encoding
响应的编码字符串，

这个编码在编码参数中传递

这个编码在 HTTP header 中的Content-Type 中声明。如果编码是无效的或未知的，它将会被忽略并且下一个解决机制将会被尝试
3.这个编码在 response body中声明。 TextResponse 中没有提供特殊的函数。但是在HtmlResponse和XmlResponse中提供了。

HtmlResponse objects

class scrapy.http.HtmlResponse(url[, ...])class scrapy.http.HtmlResponse(url[, ...])

The HtmlResponse class is a subclass of TextResponse which adds encoding auto-discovering support by looking into the HTML meta http-equiv attribute. See TextResponse.encoding.

XmlResponse objects

class scrapy.http.XmlResponse(url[, ...])

The XmlResponse class is a subclass of TextResponse which adds encoding auto-discovering support by looking into the XML declaration line. See TextResponse.encoding.

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航