您的位置:首页 > 理论基础 > 计算机网络

python scrapy 网络采集使用代理的方法

2016-05-01 00:22 856 查看
http://www.sharejs.com/codes/python/8309

1.在Scrapy工程下新建“middlewares.py”

#
Importing base64 library because we'll need it ONLY in case if the proxy we are going to use requires authentication
import
 
base64
 
 
#
Start your middleware class
class
 
ProxyMiddleware(
object
):
    
#
overwrite process request
    
def
 
process_request(
self
,
request, spider):
        
#
Set the location of the proxy
        
request.meta[
'proxy'
=
 
"http://YOUR_PROXY_IP:PORT"
 
 
        
#
Use the following lines if your proxy requires authentication
        
proxy_user_pass 
=
 
"USERNAME:PASSWORD"
        
#
setup basic authentication for the proxy
        
encoded_user_pass 
=
 
base64.encodestring(proxy_user_pass)
        
request.headers[
'Proxy-Authorization'
=
 
'Basic
'
 
+
encoded_user_pass
 
 
#该代码片段来自于: http://www.sharejs.com/codes/python/8309
2.在项目配置文件里(./project_name/settings.py)添加

DOWNLOADER_MIDDLEWARES 
=
 
{
    
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware'
:
110
,
    
'project_name.middlewares.ProxyMiddleware'
100
,
}
只要两步,现在请求就是通过代理的了。测试一下^_^

from
 
scrapy.spider 
import
 
BaseSpider
from
 
scrapy.contrib.spiders 
import
 
CrawlSpider,
Rule
from
 
scrapy.http 
import
 
Request
 
 
class
 
TestSpider(CrawlSpider):
    
name 
=
 
"test"
    
domain_name 
=
 
"whatismyip.com"
    
#
The following url is subject to change, you can get the last updated one from here :
    
http://www.whatismyip.com/faq/automation.asp
    
start_urls 
=
 
[
"http://xujian.info"
]
 
 
    
def
 
parse(
self
,
response):
        
open
(
'test.html'
'wb'
).write(response.body)
#该代码片段来自于: http://www.sharejs.com/codes/python/8309
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: