您的位置：首页 > 理论基础 > 计算机网络

python scrapy 网络采集使用代理的方法

2016-05-01 00:22 856 查看

http://www.sharejs.com/codes/python/8309

1.在Scrapy工程下新建“middlewares.py”

#
Importing base64 library because we'll need it ONLY in case if the proxy we are going to use requires authentication

import

base64

#
Start your middleware class

class

ProxyMiddleware(

object

):

#
overwrite process request

def

process_request(

self

,
request, spider):

#
Set the location of the proxy

request.meta[

'proxy'

"http://YOUR_PROXY_IP:PORT"

#
Use the following lines if your proxy requires authentication

proxy_user_pass

"USERNAME:PASSWORD"

#
setup basic authentication for the proxy

encoded_user_pass

base64.encodestring(proxy_user_pass)

request.headers[

'Proxy-Authorization'

'Basic
'

encoded_user_pass

#该代码片段来自于: http://www.sharejs.com/codes/python/8309

2.在项目配置文件里(./project_name/settings.py)添加

DOWNLOADER_MIDDLEWARES

'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware'

'project_name.middlewares.ProxyMiddleware'

只要两步，现在请求就是通过代理的了。测试一下^_^

from

scrapy.spider

import

BaseSpider

from

scrapy.contrib.spiders

import

CrawlSpider,
Rule

from

scrapy.http

import

Request

class

TestSpider(CrawlSpider):

name

"test"

domain_name

"whatismyip.com"

#
The following url is subject to change, you can get the last updated one from here :

# http://www.whatismyip.com/faq/automation.asp

start_urls

"http://xujian.info"

def

parse(

self

,
response):

open

'test.html'

'wb'

).write(response.body)

#该代码片段来自于: http://www.sharejs.com/codes/python/8309

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航