python-自己手写的贴吧爬虫
2017-05-09 10:09
295 查看
# -*- coding:utf-8 -*- import urllib2 import re def load_Page(url,begin_page,end_page): ''' 加载贴吧信息 ''' for i in range(begin_page,end_page+1): pn = 50*(i-1) my_url = url+str(pn) html = Get_Html(my_url) title = GetMainInfo(html) sumTxt = "" for item in title: sumTxt = sumTxt + item print "--------第 %d 页数据开始收集-------" % (i) # filename = "第"+str(i)+"页数据.html" SaveToTxt(str(i) + ".html", sumTxt) print "--------第 %d 页数据收集完毕--------" % (i) def Get_Html(url): """ 抓取网页信息并返回 """ User_Agent = "Mozilla/5.0 (X11; U; Linux i686)Gecko/20071127 Firefox/2.0.0.11" Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8" headers = {"User-Agent": User_Agent, "Accept": Accept} request = urllib2.Request(url, headers=headers) response = urllib2.urlopen(request) html = response.read() return html def SaveToTxt(filename,txt): f = open(filename,'a') f.write(txt) f.close() def GetMainInfo(html): regex = re.compile("<div class=\"col2_right j_threadlist_li_right \">(.*)</div>", re.S) return regex.findall(html) #mian if __name__ == "__main__": print "请输入贴吧地址" url = raw_input() print "请输入起始页码" begin_page = int(raw_input()) print "请输入结束页码" end_page = int(raw_input()) load_Page(url,begin_page,end_page)
相关文章推荐
- python新浪博客爬虫(纯自己写)
- Python爬虫爬取贴吧的帖子内容
- Python3爬虫入门之贴吧图片批量获取
- [Python爬虫]爬取贴吧图片
- Python爬虫入门——爬取贴吧图片
- python网页抓取之自己动手写字典
- 改进的Python贴吧爬虫代码
- 我的第一次Python爬虫——获取自己博客园的所有文章
- Python-贴吧图片爬虫
- python爬虫-爬取股票贴吧帖子
- [Python]新手写爬虫全过程(已完成)
- python爬虫——爬取知乎上自己关注的问题
- 【新手】python爬虫遍历贴吧用户
- [python]新手写爬虫v2.5(使用代理的异步爬虫)
- Python爬虫_获取贴吧内容
- 看到别人的Python爬虫博客,自己也模仿着写一个,顺便练习一下python
- 用Python 爬虫爬取贴吧图片
- Python]新手写爬虫全过程
- [python爬虫]爬取贴吧某页美女图片+爬取糗百美女图片
- Python爬虫__爬取贴吧图片和文本