您的位置:首页 > 理论基础 > 计算机网络

python 网络爬虫

2014-05-02 17:50 218 查看
# -*- coding: cp936 -*-
import string,urllib2
path = "G:/New Knowledge/practice/python/tmp/"
def baidu_tieba(url,begin_page,end_page):
for i in range(begin_page,end_page+1):
Name = path+string.zfill(i,5)+'.html'#自动填充
print '下载'+str(i)+'个网页,并存为'+Name
f = open(Name,'w+')
data = urllib2.urlopen(url+str(i)).read()
f.write(data)
f.close()

bdurl = raw_input(u'input url 去除最后的数字\n')
begin_page = raw_input("begin page")
end_page = raw_input("endpage")
if not bdurl:
bdurl = 'http://tieba.baidu.com/p/2296017831?pn='
if not begin_page:
begin_page = 1
if not end_page:
end_page = 10
baidu_tieba(bdurl,int(begin_page),int(end_page));


input url 去除最后的数字
http://tieba.baidu.com/p/301797825
begin page0

endpage9

下载1个网页,并存为00001.html

下载2个网页,并存为00002.html

下载3个网页,并存为00003.html

下载4个网页,并存为00004.html

下载5个网页,并存为00005.html

下载6个网页,并存为00006.html

下载7个网页,并存为00007.html

下载8个网页,并存为00008.html

下载9个网页,并存为00009.html

>>> ================================ RESTART ================================

>>>

input url 去除最后的数字

begin page

endpage

下载1个网页,并存为G:/New Knowledge/practice/python/tmp/00001.html

下载2个网页,并存为G:/New Knowledge/practice/python/tmp/00002.html

下载3个网页,并存为G:/New Knowledge/practice/python/tmp/00003.html

下载4个网页,并存为G:/New Knowledge/practice/python/tmp/00004.html

下载5个网页,并存为G:/New Knowledge/practice/python/tmp/00005.html

下载6个网页,并存为G:/New Knowledge/practice/python/tmp/00006.html

下载7个网页,并存为G:/New Knowledge/practice/python/tmp/00007.html

下载8个网页,并存为G:/New Knowledge/practice/python/tmp/00008.html

下载9个网页,并存为G:/New Knowledge/practice/python/tmp/00009.html

下载10个网页,并存为G:/New Knowledge/practice/python/tmp/00010.html
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: