您的位置:首页 > 编程语言 > Python开发

Python爬虫

2014-01-03 16:36 211 查看
综合使用urllib和urllib2

1、post发送表单数据

import urllib

import urllib2

url = 'http://localhost/CI-github/index.php/form'

values = { 'username' : 'why222',

'password' : 'test',

'passconf' : 'test',

'email' : 'test@test.com',

}

data = urllib.urlencode(values)

req = urllib2.Request(url, data)

response = urllib2.urlopen(req)

the_page = response.read()

print the_page

2、get方式连接,先生成

lib2的urlopen

3、伪造header,注意部分Header有默认值,需要覆盖

import urllib

import urllib2

url = 'http://localhost/CI-github/index.php/form'

user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:26.0) Gecko/20100101 Firefox/26.0 FirePHP/0.7.4'

values = {'username' : 'WHYaaa',

'password' : 'SDU',

'passconf' : 'SDU',

'email' : 'Python@qq.com' }

headers = { 'User-Agent' : user_agent,

'Host:' : 'localhost'

}

data = urllib.urlencode(values)

req = urllib2.Request(url, data, headers)

response = urllib2.urlopen(req)

the_page = response.read()

print the_page

当找不到主机时会报

getaddrinfo failed错误

简单的列表形式获取:

import string, urllib2

def baidu_tieba(url,begin_page,end_page):

for i in range(begin_page, end_page+1):

sName = string.zfill(i,5) + '.html'

print 'class' + str(i) + ' fetch' + sName + '......'

f = open(sName,'w+')

m = urllib2.urlopen(url + str(i)).read()

f.write(m)

f.close()

bdurl = 'http://tieba.baidu.com/p/2296017831?pn='

iPostBegin = 1

iPostEnd = 10

baidu_tieba(bdurl,iPostBegin,iPostEnd)

opener示例:

import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
opener.open('http://www.example.com/')

参考:

http://docs.python.org/2/library/urllib2.html

参考:

http://blog.csdn.net/column/details/why-bug.html

使用scrapy做爬虫框架,加入webkit可以实现动态爬虫

rhino内核
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: