您的位置：首页 > 编程语言 > Python开发

简单爬虫python实现02——BeautifulSoup的使用

2014-03-03 09:52 1046 查看

本节的内容主要是如何使用一个Python 写的一个 HTML/XML的解析器——BeautifulSoup，用它将上一节生成的result文件中所需的内容取出。

关于BeautifulSoup的简介和下载安装请参考黄聪仁兄的博客。
另附BeautifulSoup的中文文档

先上代码：

from bs4 import BeautifulSoup

txt='D:\\result1.html'
f = open(txt, "r")
html=f.read()
f.close()

bs=BeautifulSoup(html)
gvtitle=bs.find_all('div',attrs={'class':'gvtitle'})
for title in gvtitle:
print title.a.text

第1行在程序中引入了BeautifulSoup。
第3至6行是文件操作，将result1.html文件中的内容读取到html变量中。
第8、9行使用BeautifulSoup对html变量进行格式化，并使用find_all方法获取到所需的数据对象。
第10、11行将获取的“产品名称”信息全部展示出来。

对上段的代码进一步解释：
首先，第一个需求是获取如图2-1所示的产品的产品名称信息，即“NewFashion Men Slim Fit Cotton V-Neck Short Sleeve Casual T-Shirt Tops”。

图2-1

其次，观察result1.html中的源代码发现，与48个产品相对应，产品名称全部放在class为gvtitle的48个div内，如图2-2所示。因此使用find_all方法查询出所有含class属性值为gvtitle的div，以列表的形式存入gvtitle变量中。

图2-2

之后，进一步观察图2-2中html文件的结构，发现“New Fashion Men Slim Fit Cotton V-Neck Short Sleeve Casual T-ShirtTops”这行文本就放在div标签内的a标签中，所以在代码的第10行用.a.text获取并展示。上段程序运行结果如图2-3。

图2-3

接下来修改代码，进一步获取产品的价格信息。这部分使用了正则表达式，关于正则表达式的知识请参考——正则表达式入门教程

代码如下：

import re
from bs4 import BeautifulSoup

def getItems(html):
pattern = re.compile('\d+.\d+')
items = re.findall(pattern,html)
return items

txt='D:\\result1.html'
f = open(txt, "r")
html=f.read()
f.close()

bs=BeautifulSoup(html)
gvtitle=bs.find_all('div',attrs={'class':'gvtitle'})
for title in gvtitle:
print title.a.text

prices=bs.find_all('span',attrs={'class':'amt'})
for pri in prices:
result=getItems(str(pri))
print result

第1行引入正则表达式
第4至7行定义了一个方法，功能是使用正则表达式对BeautifulSoup获取来的数据进一步筛选。
第19行使用find_all方法获取到所需的价格。不过观察result1.html中的源码发现，class为amt的span中的数据不如第15行获取的数据那么规则，所以在第21行调用了getItems方法对数据进一步筛选，获得价格。

程序运行结果如图2-4，不过有些产品获取到了2个价格。这是因为有些产品下有多个子产品，这两个数值表示的是最低价和最高价，如图2-5。
为求结果集简单，就将两值取平均值作为最终值。

图2-4

图2-5

修改代码，最后如下：

import re
import sys
from bs4 import BeautifulSoup

def getItems(html):
pattern = re.compile('\d+.\d+')
items = re.findall(pattern,html)
return items

p=0
while p<5:
print ' =='+str(p+1)+'==start=='
txt='D:\\result'+str(p+1)+'.html'
fr = open(txt, "r")
html=fr.read()
bs=BeautifulSoup(html)
gvtitle=bs.find_all('div',attrs={'class':'gvtitle'})

pri_list=[]
prices=bs.find_all('span',attrs={'class':'amt'})
for pri in prices:
res=getItems(str(pri))
if len(res)==2:
val=(float(res[0])+float(res[1]))/2
pri_list.append(val)
else:
pri_list.append(res[0])

j=0
while j<48:
try:
print gvtitle[j].a.text,pri_list[j]
except UnicodeEncodeError, e:
for s in gvtitle[j].a.text:
try:
sys.stdout.write(s)
except UnicodeEncodeError, e:
continue
sys.stdout.flush()
print pri_list[j]
j=j+1
print ' =='+str(p+1)+'====end=='
p=p+1
fr.close()

第31至40行古怪的输出代码是为了删除掉产品名称中有时会出现的“™”符号，暂时没想到更好的办法，恳请各位童鞋指教。

本节到此结束，多谢。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航