python爬虫入门学习
2017-12-14 15:33
232 查看
近期写的一个爬虫的Demo,只是简单的用了几个函数。实现了简单的爬取网页的功能(以途牛为例)。
View Code
import urllib2 import re import urlparse import robotparser import datetime import time class Throttle: """ Add a delay to the same domain between two download """ def __init__(self, delay): # amount of delay between download of a domain self.delay = delay # timestamp of when a domain was last accessed self.domains = {} def wait(self, url): domain = urlparse.urlparse(url).netloc last_accessed = self.domains.get(domain) if self.delay > 0 and last_accessed is not None: sleep_sec = self.delay - (datetime.datetime.now() - last_accessed).seconds if sleep_sec >= 0: time.sleep(sleep_sec) print 'sleep: ', sleep_sec, 's' self.domains[domain] = datetime.datetime.now() def download(url, proxy, user_agent='wawp', num_retries=2): print 'Downloading:', url headers = {'User-agent': user_agent} request = urllib2.Request(url, headers=headers) opener = urllib2.build_opener() if proxy: proxy_param = {urlparse.urlparse(url).scheme: proxy} opener.add_handler(urllib2.ProxyHandler(proxy_param)) try: html = opener.open(request).read() except urllib2.URLError as e: print 'Downloading error:', e.reason, '\n' html = '' if num_retries > 0: if hasattr(e, 'code') and 500 <= e.code < 600: return download(url, proxy, user_agent, num_retries - 1) return html def get_links(html, regstr=r'http:\/\/[^w].*\.tuniu\.com'): reg = regstr rexp = re.compile(reg) return re.findall(rexp, html) def deduplicate_list(inputList): new_list = [] for x in inputList: if x not in new_list: new_list.append(x) return new_list def crawl_sitemap(url): sitemap = download(url) links = get_links(sitemap) print 'before links are : ', links newlinks = deduplicate_list(links) print 'after links are : ', newlinks for link in newlinks: print link download(link) def get_robot(url): rp = robotparser.RobotFileParser() rp.set_url(urlparse.urljoin(url, 'robots.txt')) rp.read() return rp def link_crawler(seed_url, max_depth=3, link_regex=r'http:\/\/[^w][^"]*\.tuniu\.com', delay=1, proxy=None): # For robots.txt check install rp = get_robot(seed_url) # init vars throttle = Throttle(delay) crwal_queue = [seed_url] seen = {seed_url: 0} while crwal_queue: url = crwal_queue.pop() depth = seen[url] if depth != max_depth: if rp.can_fetch('heimaojingzhang', url): # here just for joking throttle.wait(url) html = download(url, proxy) # print 'down func ', url for link in get_links(html, link_regex): link = urlparse.urljoin(seed_url, link) if link not in seen: seen[link] = depth + 1 crwal_queue.append(link) else: print 'Blocked by robot.txt ', url # TODO: # fix bugs: (in regex) done on : 2017/09/23 23:16 # delay: done on : 2017/09/24 21:36 # proxy # depth: done on : 2017/09/23 23:10 if __name__ == '__main__': link_crawler('http://www.tuniu.com/corp/sitemap.shtml', link_regex=r'http:\/\/www\.tuniu\.com\/guide\/[^"]*') # html = download('http://www.tuniu.com/corp/sitemap.shtml') # print html
View Code
相关文章推荐
- Python爬虫入门学习
- python——爬虫学习——Scrapy爬虫框架入门-(6)
- python小白入门学习笔记-爬虫入门
- python3 [入门基础实战] 爬虫入门之智联招聘的学习(一)
- Python爬虫(入门+进阶)学习笔记 1-2 初识Python爬虫
- Python爬虫入门学习--(单线程爬虫)
- Python爬虫入门学习例子之糗事百科
- Python 爬虫如何入门学习?
- python3 [入门基础实战] 爬虫入门之智联招聘的学习(一)
- Python3学习(34)--简单网页内容抓取(爬虫入门一)
- Python的爬虫程序编写框架Scrapy入门学习教程
- Python爬虫(入门+进阶)学习笔记 1-7 数据入库之MongoDB(案例二:爬取拉勾)
- 学习Python就业有哪些方向?附加视频教程(python3从入门到进阶(面向对象),实战(爬虫,飞机游戏,GUI)视频教程)
- Python 爬虫如何入门学习?
- Python爬虫(入门+进阶)学习笔记 1-4 使用Xpath解析豆瓣短评
- 如何学习Python爬虫[入门篇]
- Python 爬虫如何入门学习?
- python学习之第一个入门爬虫
- python学习(4):python爬虫入门案例-爬取图片
- 学习Python爬虫(三):Requests库入门级使用