您的位置：首页 > 其它

Scrapy 扩展中间件: 针对特定响应状态码，使用代理重新请求

2018-07-18 18:47 423 查看

0.参考

https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermiddlewares.redirect

https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermiddlewares.httpproxy

1.主要实现

实际爬虫过程中如果请求过于频繁，通常会被临时重定向到登录页面即302，甚至是提示禁止访问即403，因此可以对这些响应执行一次代理请求：

(1) 参考原生 redirect.py 模块，满足 dont_redirect 或 handle_httpstatus_list 等条件时，直接传递 response

(2) 不满足条件(1)，如果响应状态码为 302 或 403，使用代理重新发起请求

(3) 使用代理后，如果响应状态码仍为 302 或 403，直接丢弃

2.代码实现

保存至 /site-packages/my_middlewares.py

from w3lib.url import safe_url_string
from six.moves.urllib.parse import urljoin

from scrapy.exceptions import IgnoreRequest

class MyAutoProxyDownloaderMiddleware(object):

def __init__(self, settings):
self.proxy_status = settings.get('PROXY_STATUS', [302, 403])
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html?highlight=proxy#module-scrapy.downloadermiddlewares.httpproxy
self.proxy_config = settings.get('PROXY_CONFIG', 'http://username:password@some_proxy_server:port')

@classmethod
def from_crawler(cls, crawler):
return cls(
settings = crawler.settings
)

# See /site-packages/scrapy/downloadermiddlewares/redirect.py
def process_response(self, request, response, spider):
if (request.meta.get('dont_redirect', False) or
response.status in getattr(spider, 'handle_httpstatus_list', []) or
response.status in request.meta.get('handle_httpstatus_list', []) or
request.meta.get('handle_httpstatus_all', False)):
return response

if response.status in self.proxy_status:
if 'Location' in response.headers:
location = safe_url_string(response.headers['location'])
redirected_url = urljoin(request.url, location)
else:
redirected_url = ''

# AutoProxy for first time
if not request.meta.get('auto_proxy'):
request.meta.update({'auto_proxy': True, 'proxy': self.proxy_config})
new_request = request.replace(meta=request.meta, dont_filter=True)
new_request.priority = request.priority + 2

spider.log('Will AutoProxy for <{} {}> {}'.format(
response.status, request.url, redirected_url))
return new_request

# IgnoreRequest for second time
else:
spider.logger.warn('Ignoring response <{} {}>: HTTP status code still in {} after AutoProxy'.format(
response.status, request.url, self.proxy_status))
raise IgnoreRequest

return response

3.调用方法

(1) 项目 settings.py 添加代码，注意必须在默认的 RedirectMiddleware 和 HttpProxyMiddleware 之间。

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
# 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600,
'my_middlewares.MyAutoProxyDownloaderMiddleware': 601,
# 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,
}
PROXY_STATUS = [302, 403]
PROXY_CONFIG = 'http://username:password@some_proxy_server:port'

4.运行结果

2018-07-18 18:42:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/> (referer: None)
2018-07-18 18:42:38 [test] DEBUG: Will AutoProxy for <302 http://httpbin.org/status/302> http://httpbin.org/redirect/1
2018-07-18 18:42:43 [test] DEBUG: Will AutoProxy for <403 https://httpbin.org/status/403>
2018-07-18 18:42:51 [test] WARNING: Ignoring response <302 http://httpbin.org/status/302>: HTTP status code still in [302, 403] after AutoProxy
2018-07-18 18:42:52 [test] WARNING: Ignoring response <403 https://httpbin.org/status/403>: HTTP status code still in [302, 403] after AutoProxy

代理服务器 log：

squid [18/Jul/2018:18:42:53 +0800] "GET http://httpbin.org/status/302 HTTP/1.1" 302 310 "-" "Mozilla/5.0" TCP_MISS:HIER_DIRECT
squid [18/Jul/2018:18:42:54 +0800] "CONNECT httpbin.org:443 HTTP/1.1" 200 3560 "-" "-" TCP_TUNNEL:HIER_DIRECT

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航