您的位置：首页 > 理论基础 > 计算机网络

python 网络爬虫（二） BFS不断抓URL并放到文件中

2013-09-10 11:51 507 查看

上一篇的python 网络爬虫（一）简单demo 还不能叫爬虫，只能说基础吧，因为它没有自动化抓链接的功能。

本篇追加如下功能：

【1】广度优先搜索不断抓URL，直到队列为空

【2】把所有的URL写入文件中

【3】对于不可访问或错误访问的URL，有try except 处理

spider.py

# -*- coding: cp936 -*-
import urllib,Queue,sgmllib,re,os

class URLList(sgmllib.SGMLParser):
def reset(self):
sgmllib.SGMLParser.reset(self)
#maxsize < 1 表示无穷队列
self.URLqueue = Queue.Queue(maxsize = -1)
def start_a(self,attrs):
href = [v for k,v in attrs if k == 'href']
if href:
for u in href:
#判断URL是不是正确的，href都必须有"http://"
pat = re.compile(r'http://(.+?)')
#False,0,'',[],{},()都可以视为假，也可以用len()==0判断列表为空
if len(re.findall(pat,u)) == 0:
continue

self.URLqueue.put(u)

def getURLList(url,parser):
try:
URLdata = urllib.urlopen(url)
parser.feed(URLdata.read())
URLdata.close()
except:
return

startURL = "http://www.baidu.com"
parser = URLList()
getURLList(startURL,parser)

outfile = startURL[7:len(startURL)]+".txt"
out = open(outfile,'w+')

try:
#BFS
while parser.URLqueue.empty() == False:
url = parser.URLqueue.get()
print url
out.writelines(url+'\n')
getURLList(url,parser)
finally:
parser.close()
out.close()

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航