您的位置：首页 > 编程语言 > Python开发

爬虫入门：Python （问题集合）

2015-11-26 10:23 681 查看

任意门：Python爬虫系统讲解

Q1

关于代理的事情

1. 从简单的说起——入门就有问题

出现在“分分钟扒一个网页下来”的问题：

运行如教程的代码，结果报错：（在公司使用这段代码）IOError: [Errno socket error] [Errno 10060]

Step 1：当我尝试将url地址换成公司某一内网的时候，我发现这段代码works fine

Step 2：根据评论：

“

You can do it even without the HTTP_PROXY environment variable. Try this sample:

import urllib2

proxy_support = urllib2.ProxyHandler({"http":"http://61.233.25.166:80"})
opener = urllib2.build_opener(proxy_support)
urllib2.install_opener(opener)

html = urllib2.urlopen("http://www.google.com").read()
print html

In your case it really seems that the proxy server is refusing the connection.

”
修改代码，发现代码可以正常工作了：（起码暂时可以工作了）

import urllib2

proxy_support = urllib2.ProxyHandler({'http':'someProxy'})
opener = urllib2.build_opener(proxy_support)
urllib2.install_opener(opener)
# open a link and return one object description
response = urllib2.urlopen('http://sc.house.sina.com.cn/')
# read the object to get what inside

2. urllib2.urlopen() 可以打开的并不见得自定义 opener 就可以打开

import urllib2
import cookielib

proxy_support = urllib2.ProxyHandler({'http':'someProxy'})
opener0 = urllib2.build_opener(proxy_support)
urllib2.install_opener(opener0)

# build request
url = 'http://passport.csdn.net/account/login' # cannot work on opener, but can work on urlopen
urlin = 'http://neiwangneiwangxxxxx.org/jira/secure/Dashboard.jspa' #
request = urllib2.Request(url)
requestin = urllib2.Request(urlin)

cookie = cookielib.CookieJar()
chandler = urllib2.HTTPCookieProcessor(cookie)
print 'debug: ' , cookie, type(cookie),chandler, type(chandler)
# create character opener
opener = urllib2.build_opener(chandler)
urllib2.install_opener(opener)

try:
# use urllib2 to open the url: success
#response = urllib2.urlopen(request) # can open it
#print response.read()

#response = opener.open(url) # fail: it's nothing with parameter in open()
response = opener.open(request) # fail
except urllib2.URLError, e:
if hasattr(e, 'reason'):
print e.reason

for item in cookie: # read from response = urllib2.urlopen(request)
print 'name:', item.name
print 'value:', item.value

在这段代码中，使用自定义的 opener（注意以上使用的是自定义的opener，并不是 urllib2 来open一个url 的）企图打开 url 的时候就会发现一个熟悉的报错：[Errno 10060]—— 看到这里，手一抖试了下内网 urlin（经测试访问内网可以，外网不行），果然能够愉快的运行。看来，问题绝壁又可能
1. 出现在代理的身上了，或者说 urllib2.install_opener() 使用了两次：建立 proxy 的时候一次，自定义cookie opener 的时候一次，urllib2可以install 几个opener？

注意这里的urllib2 使用的方法是 urlopen，自定义opener 使用的方法是 opener。

2.往 urllib2 中成功建立了opener，那就使用 urllib2 打开——直接使用 opener 打开行不通。

Q2

选择的 url 可能会导致链接超时的问题

import urllib2
import urllib

# install proxy
proxy_support = urllib2.ProxyHandler({'http':'someProxy'})
opener = urllib2.build_opener(proxy_support)
urllib2.install_opener(opener)

url = 'http://passport.csdn.net/account/login'
url1 = 'https://passport.csdn.net/account/login?from=http://my.csdn.net/my/mycsdn'
url0 = 'http://i.house.sina.com.cn/index.php?ctrl=login&returnurl=http://sc.house.sina.com.cn/index.shtml'

# use 'request' to pack parameters:url/data/headers..., urllib open 'request' directly
request = urllib2.Request(url0)
response = urllib2.urlopen(request)
print response.read()

将需要打开的变量url 改为 url1 会报错：

Traceback (most recent call last):

File "xxxxxxxxxxx", line 17, in <module>

response = urllib2.urlopen(request)

File "C:\Python27\lib\urllib2.py", line 154, in urlopen

return opener.open(url, data, timeout)

File "C:\Python27\lib\urllib2.py", line 431, in open

response = self._open(req, data)

File "C:\Python27\lib\urllib2.py", line 449, in _open

'_open', req)

File "C:\Python27\lib\urllib2.py", line 409, in _call_chain

result = func(*args)

File "C:\Python27\lib\urllib2.py", line 1240, in https_open

context=self._context)

File "C:\Python27\lib\urllib2.py", line 1197, in do_open

raise URLError(err)

URLError: <urlopen error [Errno 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond>

但是同一个页面，选择变量 url 就不会出现错误，注意到我们前面安装的代理采用的是http 协议，而url1 的地址却是
https 协议！

那么怎么才能同时安装 http 和 https 协议呢？

Q3

模拟登录

values = {# use source file 'input' tag's name???
'os_username':'xx@xx.com',
'os_password':'123'
}
data = urllibe.urlencode(values)

所以构建登录用户账户的 data 字典的时候到底 key 是以什么命名的？是需要根据网页上的 input 标签中的 name 来命名还是自己命名？自认为需要根据网页页面的编写来命名

——使用
Fiddler 抓取页面 login 信息， webform的内容就是 values 里面的内容（最好将所有的Name-Value都写入），之前又一次是登录失败，原因就是没有将所有的value内容写入

登录失败部分代码：

values = {# use source file 'input' tag's name???
'username' : 'hahaha@xxxsoft.com',
'password' : '233333',
}

登陆成功部分代码：

values = {# use source file 'input' tag's name???
'username' : 'hahaha@xxxsoft.com',
'password' : '233333',
'os_destination' : '',
'user_role' : '',
'atl_token' : '',
'login' : 'Log In'
}

values字典内的所有数据必须都要跟 Fiddler 抓取的数据保持一致，减少错误。

Tips：将抓取到的内容复制粘贴下来，改成 html 格式就可以查看是否成功抓取登陆后的页面

Q4

怎么去设置代理

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航