python爬虫的初级入门
2017-03-16 17:46
302 查看
1.如何获取一个网站的所有路径?
#coding:utf-8 import urllib2 import re import itertools import threading import urlparse import robotparser #download是用于下载网页的,并附加服务器名,添加重新访问网站的次数 def download(url,user_agent='wswp',num_retries=2): print 'DownLoading:',url headers = {'User-agent':user_agent} request = urllib2.Request(url,headers=headers) try: html=urllib2.urlopen(request).read() except urllib2.URLError as e: print 'DownLoad error',e.reason html=None if num_retries>0: if hasattr(e,'code') and 500<=e.code<600: return download(url,user_agent,num_retries=-1) return html #crawl_sitemap是用于进行网站地图的爬取,将所有<a href="(site)">中site爬取出来 def crawl_sitemap(url): sitemap = download(url) links = re.findall('<a href="(\S*)">',sitemap) for link in links: html = download(link) #crawl_sitemap('http://sitemap.163.com/') #link_crawler方法是用于将种子地址的所有连接都爬取下来 def link_crawler(seed_url,link_regex): crawl_queue = [seed_url] seen=set(crawl_queue) while crawl_queue: url=crawl_queue.pop() html=download(url) for link in get_links(html): if link in get_links(html): if re.match(link_regex,link): link=urlparse.urljoin(seed_url,link) if link not in seen: seen.add(link) crawl_queue.append(link) def get_links(html): webpage_regex=re.compile('\S*') return webpage_regex.findall(html)在最后运行 link_crawler('http://map.163.com'')就可以完成爬取所有网址的操作了!
二。如何进行网址抓取?
1可以使用正则表达式import re import urllib2 def download(url,user_agent='wswp',num_retries=2): print 'DownLoading:',url headers = {'User-agent':user_agent} request = urllib2.Request(url,headers=headers) try: html=urllib2.urlopen(request).read() except urllib2.URLError as e: print 'DownLoad error',e.reason html=None if num_retries>0: if hasattr(e,'code') and 500<=e.code<600: return download(url,user_agent,num_retries=-1) return html url='http://sitemap.163.com/' html=download(url) print re.findall('<a href="(\S*)">',html)2.可以使用美味汤beautifulsoup
from bs4 import BeautifulSoup f=open("index.html") file_object = open('index.html') try: f = file_object.read( ) finally: file_object.close( ) soup = BeautifulSoup(f) for link in soup.find_all('a'): print(link.get('href'))3.可以使用lxml进行提取
相关文章推荐
- Python爬虫—1入门_1_python内置urllib库的初级用法
- python3爬虫初级入门和正则表达式
- 一种基于迭代与分类识别方法的入门级Python爬虫
- python爬虫零基础入门
- 菜鸟python入门爬虫手记
- Python爬虫入门基础
- python python 入门学习之网页数据爬虫cnbeta文章保存
- Python 学习入门(6)—— 网页爬虫
- python小白入门学习笔记-爬虫入门
- python爬虫框架Scrapy入门:安装
- Python爬虫入门
- Python爬虫入门
- [Python]网络爬虫(12):爬虫框架Scrapy的第一个爬虫示例入门教程
- Python入门(三,初级)
- python网络爬虫入门(一)——简单的博客爬虫
- 菜鸟python入门爬虫手记(2)
- Python 学习入门(6)—— 网页爬虫
- Python 爬虫如何入门学习?
- python Scrapy 框架做爬虫 ——入门地图
- Python爬虫入门三之Urllib库的基本使用