Python 爬虫笔记(获取整个站点中的所有外部链接)
2016-09-28 14:59
579 查看
#! /usr/bin/env python #coding=utf-8 import urllib2 from bs4 import BeautifulSoup import re import datetime import random pages=set() random.seed(datetime.datetime.now()) #Retrieves a list of all Internal links found on a page def getInternalLinks(bsObj, includeUrl): internalLinks = [] #Finds all links that begin with a "/" for link in bsObj.findAll("a", href=re.compile("^(/|.*"+includeUrl+")")): if link.attrs['href'] is not None: if link.attrs['href'] not in internalLinks: internalLinks.append(link.attrs['href']) return internalLinks #Retrieves a list of all external links found on a page def getExternalLinks(bsObj, excludeUrl): externalLinks = [] #Finds all links that start with "http" or "www" that do #not contain the current URL for link in bsObj.findAll("a", href=re.compile("^(http|www)((?!"+excludeUrl+").)*$")): if link.attrs['href'] is not None: if link.attrs['href'] not in externalLinks: externalLinks.append(link.attrs['href']) return externalLinks def splitAddress(address): addressParts = address.replace("http://", "").split("/") return addressParts def getRandomExternalLink(startingPage): html= urllib2.urlopen(startingPage) bsObj = BeautifulSoup(html) externalLinks = getExternalLinks(bsObj, splitAddress(startingPage)[0]) if len(externalLinks) == 0: internalLinks = getInternalLinks(startingPage) return internalLinks[random.randint(0, len(internalLinks)-1)] else: return externalLinks[random.randint(0, len(externalLinks)-1)] def followExternalOnly(startingSite): externalLink=getRandomExternalLink("http://www.iamnotgay.cn") print("Random external link is: "+externalLink) followExternalOnly(externalLink) #Collects a list of all external URLs found on the site allExtLinks=set() allIntLinks=set() def getAllExternalLinks(siteUrl): html=urllib2.urlopen(siteUrl) bsObj=BeautifulSoup(html) internalLinks = getInternalLinks(bsObj,splitAddress(siteUrl)[0]) externalLinks = getExternalLinks(bsObj,splitAddress(siteUrl)[0]) for link in externalLinks: if link not in allExtLinks: allExtLinks.add(link) print(link) for link in internalLinks: if link not in allIntLinks: print("About to get link:"+link) allIntLinks.add(link) getAllExternalLinks(link) getAllExternalLinks("http://iamnotgay.cn")
收集所有外部链接的网站爬虫程序流程图
相关文章推荐
- Python爬虫获取整个站点中的所有外部链接代码示例
- Python爬虫小实践:获取某个网站所有的外部链接以及内部链接
- WSWP(用python写爬虫)笔记二:实现链接获取和数据存储爬虫
- Python 网络爬虫 009 (编程) 通过正则表达式来获取一个网页中的所有的URL链接,并下载这些URL链接的源代码
- Python 网络爬虫 009 (编程) 通过正则表达式来获取一个网页中的所有的URL链接,并下载这些URL链接的源代码
- 获取当前页面的所有链接的四种方法对比(python 爬虫)
- Python 网络爬虫 008 (编程) 通过ID索引号遍历目标网页里链接的所有网页
- Python爬虫框架Scrapy 学习笔记 10.1 -------【实战】 抓取天猫某网店所有宝贝详情
- Python爬虫实战(5):模拟登录淘宝并获取所有订单
- python获取所有链接保存到数据表并依次打开
- Python 网络爬虫 004 (编程) 如何编写一个网络爬虫,来下载(或叫:爬取)一个站点里的所有网页
- Python 网络爬虫 008 (编程) 通过ID索引号遍历目标网页里链接的所有网页
- C++和python如何获取百度搜索结果页面下信息对应的真实链接(百度搜索爬虫,可指定页数)
- Python爬虫框架Scrapy 学习笔记 10.2 -------【实战】 抓取天猫某网店所有宝贝详情
- python爬虫 抓取一个网站的所有网址链接
- Python 网络爬虫 007 (编程) 通过网站地图爬取目标站点的所有网页
- 【使用JSOUP实现网络爬虫】获取所有链接
- 【使用JSOUP实现网络爬虫】获取所有链接
- Python爬虫实战五之模拟登录淘宝并获取所有订单
- Python爬虫实战(5):模拟登录淘宝并获取所有订单(1)