Python爬虫
2014-01-03 16:36
211 查看
综合使用urllib和urllib2
1、post发送表单数据
import urllib
import urllib2
url = 'http://localhost/CI-github/index.php/form'
values = { 'username' : 'why222',
'password' : 'test',
'passconf' : 'test',
'email' : 'test@test.com',
}
data = urllib.urlencode(values)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
the_page = response.read()
print the_page
2、get方式连接,先生成
lib2的urlopen
3、伪造header,注意部分Header有默认值,需要覆盖
import urllib
import urllib2
url = 'http://localhost/CI-github/index.php/form'
user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:26.0) Gecko/20100101 Firefox/26.0 FirePHP/0.7.4'
values = {'username' : 'WHYaaa',
'password' : 'SDU',
'passconf' : 'SDU',
'email' : 'Python@qq.com' }
headers = { 'User-Agent' : user_agent,
'Host:' : 'localhost'
}
data = urllib.urlencode(values)
req = urllib2.Request(url, data, headers)
response = urllib2.urlopen(req)
the_page = response.read()
print the_page
当找不到主机时会报
getaddrinfo failed错误
简单的列表形式获取:
import string, urllib2
def baidu_tieba(url,begin_page,end_page):
for i in range(begin_page, end_page+1):
sName = string.zfill(i,5) + '.html'
print 'class' + str(i) + ' fetch' + sName + '......'
f = open(sName,'w+')
m = urllib2.urlopen(url + str(i)).read()
f.write(m)
f.close()
bdurl = 'http://tieba.baidu.com/p/2296017831?pn='
iPostBegin = 1
iPostEnd = 10
baidu_tieba(bdurl,iPostBegin,iPostEnd)
opener示例:
参考:
http://docs.python.org/2/library/urllib2.html
参考:
http://blog.csdn.net/column/details/why-bug.html
使用scrapy做爬虫框架,加入webkit可以实现动态爬虫
rhino内核
1、post发送表单数据
import urllib
import urllib2
url = 'http://localhost/CI-github/index.php/form'
values = { 'username' : 'why222',
'password' : 'test',
'passconf' : 'test',
'email' : 'test@test.com',
}
data = urllib.urlencode(values)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
the_page = response.read()
print the_page
2、get方式连接,先生成
lib2的urlopen
3、伪造header,注意部分Header有默认值,需要覆盖
import urllib
import urllib2
url = 'http://localhost/CI-github/index.php/form'
user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:26.0) Gecko/20100101 Firefox/26.0 FirePHP/0.7.4'
values = {'username' : 'WHYaaa',
'password' : 'SDU',
'passconf' : 'SDU',
'email' : 'Python@qq.com' }
headers = { 'User-Agent' : user_agent,
'Host:' : 'localhost'
}
data = urllib.urlencode(values)
req = urllib2.Request(url, data, headers)
response = urllib2.urlopen(req)
the_page = response.read()
print the_page
当找不到主机时会报
getaddrinfo failed错误
简单的列表形式获取:
import string, urllib2
def baidu_tieba(url,begin_page,end_page):
for i in range(begin_page, end_page+1):
sName = string.zfill(i,5) + '.html'
print 'class' + str(i) + ' fetch' + sName + '......'
f = open(sName,'w+')
m = urllib2.urlopen(url + str(i)).read()
f.write(m)
f.close()
bdurl = 'http://tieba.baidu.com/p/2296017831?pn='
iPostBegin = 1
iPostEnd = 10
baidu_tieba(bdurl,iPostBegin,iPostEnd)
opener示例:
import urllib2 opener = urllib2.build_opener() opener.addheaders = [('User-agent', 'Mozilla/5.0')] opener.open('http://www.example.com/')
参考:
http://docs.python.org/2/library/urllib2.html
参考:
http://blog.csdn.net/column/details/why-bug.html
使用scrapy做爬虫框架,加入webkit可以实现动态爬虫
rhino内核
相关文章推荐
- Python爬虫初学(4)
- 总结python爬虫抓站的实用技巧
- Python爬虫实战(2):自定义年月日国别抓取Bing壁纸
- python小爬虫
- Python爬虫防封杀方法集合
- python的urllib2和beautifulsoup编写爬虫
- Python 爬虫入门(二)—— IP代理使用
- selenium+python+phantomjs爬虫博客排行榜
- 如何让你的Python爬虫采集得更快
- Python爬虫框架Scrapy安装
- python 小爬虫爬取美女图片
- Python 爬虫 抓取百度图片资源 压缩图片
- python_爬虫今日头条
- python编写网络爬虫程序
- python爬虫数据保存到本地各种格式的方法
- Python爬虫系列(七):提高解析效率
- 一个Python小爬虫
- Python爬虫_简单获取百度贴吧图片
- Python 爬虫 大量数据清洗 ---- sql语句优化
- python爬虫知乎