python 爬虫 (错误很多)
2014-02-13 00:00
253 查看
不怎么会用PYTHON写。。今天试了一下。
#!/usr/bin/python # vim: set fileencoding=utf-8: import sys import urllib2 import re import sqlite3 import hashlib import random from BeautifulSoup import BeautifulSoup class SpriderUrl: # 初始化 def __init__(self,url,domain_name): self.url=url self.domain_name=domain_name # 获得URL列表 def getUrl(self): urls=[] # try: body_text=urllib2.urlopen(self.url).read() soup=BeautifulSoup(body_text) links=soup.findAll('a') # connect sqllite3 md5_str=hashlib.md5(str(random.randint(1,100000))+"aa") print "data_name:"+md5_str.hexdigest() # create sqlite3 data name con=sqlite3.connect(md5_str.hexdigest()+".db") # create sqlite3 table name con.execute("""create table url_data(id interger auto_increment primary key,url TEXT not null)""") for link in links: if re.match('(.*)\:\/\/'+self.domain_name,link.get('href')): urls.append(link.get('href')) con.execute("insert into url_data(url)values('"+link.get('href')+"')") con.commit() while len(urls)>0: for url in urls: body_text2=urllib2.urlopen(url).read() soup2=BeautifulSoup(body_text2) links2=soup2.findAll('a') for link2 in links2: if re.match('(.*)\:\/\/'+self.domain_name,link2.get('href')): test=link2.get('href') cur=con.execute("select * from url_data where url='"+test+"'") bool_itm=cur.fetchone() if bool_itm is None: urls.append(link2.get('href')) con.execute("insert into url_data(url)values('"+test+"')") else: continue else: continue print "Done" t=SpriderUrl('http://www.baidu.com/',"www.baidu.com") t.getUrl()
相关文章推荐
- python 爬虫出现403禁止访问错误详解
- Python爬虫错误记录
- 网络爬虫在Python 3.5下出现“ cannot import name 'HTMLParseError'”错误解决办法
- Python网页爬虫提示urllib2.HTTPError: HTTP Error 403: Forbidden 错误
- python爬虫出现HTTPError :403:forbidden错误!
- python爬虫结果写入html时编码错误
- Python3环境安装Scrapy爬虫框架过程及常见错误
- Python网页爬虫提示urllib2.HTTPError: HTTP Error 403: Forbidden 错误
- python笔记 爬虫经常出现的错误UnicodeEncodeError
- Python爬虫之urllib库里面的处理错误
- 【Python爬虫错误】ConnectionResetError: [WinError 10054] 远程主机强迫关闭了一个现有的连接
- Python3环境安装Scrapy爬虫框架过程及常见错误
- python 3 爬虫小白PyCharm爬取简单网页信息控制台错误
- 解决 python爬虫'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte‘错误
- python爬虫:从页面下载图片以及编译错误解决。
- Python3环境安装Scrapy爬虫框架过程及常见错误
- Python爬虫--Ubuntu14.04 上Scrapy的安装和错误处理
- PYTHON 爬虫错误总结
- python爬虫(爬取蜂鸟网高像素图片)_空网页,错误处理
- Python爬虫(二)“我想要很多很多的表情包”