您的位置:首页 > 编程语言 > Python开发

python爬虫之Scrapy 使用代理配置

2016-10-28 00:00 615 查看
在爬取网站内容的时候,最常遇到的问题是:网站对IP有限制,会有防抓取功能,最好的办法就是IP轮换抓取(加代理)

下面内容分作两部分第一部分来自网络,第二部分写的使用大蚂蚁代理的代码

###########################第一部分################################################

下面来说一下Scrapy如何配置代理,进行抓取

1.在Scrapy工程下新建“middlewares.py”

1
2
3
4
5
6
7
8
9
10
11
12
13
14
#Importingbase64librarybecausewe'llneeditONLYincaseiftheproxywearegoingtouserequiresauthentication
importbase64
#Startyourmiddlewareclass
classProxyMiddleware(object):
#overwriteprocessrequest
defprocess_request(self,request,spider):
#Setthelocationoftheproxy
request.meta['proxy']="http://YOUR_PROXY_IP:PORT"

#Usethefollowinglinesifyourproxyrequiresauthentication
proxy_user_pass="USERNAME:PASSWORD"
#setupbasicauthenticationfortheproxy
encoded_user_pass=base64.encodestring(proxy_user_pass)
request.headers['Proxy-Authorization']='Basic'+encoded_user_pass
2.在项目配置文件里(./pythontab/settings.py)添加

1
2
3
4
DOWNLOADER_MIDDLEWARES={
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware':110,
'pythontab.middlewares.ProxyMiddleware':100,
}
############################第二部分#######################################

importhashlib
importtime
#Startyourmiddlewareclass
classProxyMiddleware(object):
#overwriteprocessrequest
defprocess_request(self,request,spider):
#Setthelocationoftheproxy
request.meta['proxy']="http://代理地址:端口"
appkey="yourappkey"
secret="yoursercretnumstring"
paramMap={"app_key":appkey,"timestamp":time.strftime
7fe8
("%Y-%m-%d%H:%M:%S")}
keys=paramMap.keys()
keys.sort()
codes="%s%s%s"%(secret,str().join('%s%s'%(key,paramMap[key])forkeyinkeys),secret)
sign=hashlib.md5(codes).hexdigest().upper()
paramMap["sign"]=sign
keys=paramMap.keys()
authHeader="MYH-AUTH-MD5"+str('&').join('%s=%s'%(key,paramMap[key])forkeyinkeys)
request.headers['Proxy-Authorization']=authHeader
#printauthHeader

DOWNLOADER_MIDDLEWARES={
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware':110,
'yourproject.middlewares.ProxyMiddleware':100,
}
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: