您的位置：首页 > 编程语言 > Python开发

[转]python下很帅气的爬虫包 - Beautiful Soup 示例

2014-12-06 15:13 246 查看

原文地址http://blog.csdn.net/watsy/article/details/14161201

先发一下官方文档地址。http://www.crummy.com/software/BeautifulSoup/bs4/doc/

建议有时间可以看一下python包的文档。

Beautiful Soup 相比其他的html解析有个非常重要的优势。html会被拆解为对象处理。全篇转化为字典和数组。

相比正则解析的爬虫，省略了学习正则的高成本。

相比xpath爬虫的解析，同样节约学习时间成本。虽然xpath已经简单点了。（爬虫框架Scrapy就是使用xpath）

安装

linux下可以执行

[plain] view plaincopy

apt-get install python-bs4

也可以用python的安装包工具来安装

[html] view plaincopy

easy_install beautifulsoup4

pip install beautifulsoup4

使用简介

下面说一下BeautifulSoup 的使用。

解析html需要提取数据。其实主要有几点

1：获取指定tag的内容。

[plain] view plaincopy

hello, watsy

hello, beautiful soup.

2：获取指定tag下的属性。

[html] view plaincopy

watsy's blog

3：如何获取，就需要用到查找方法。

使用示例采用官方

[html] view plaincopy

html_doc = """

The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.

...

"""

格式化输出。

[html] view plaincopy

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc)

print(soup.prettify())

The Dormouse's story

Once upon a time there were three little sisters; and their names were

Elsie

,

Lacie

and

Tillie

; and they lived at the bottom of a well.

...

获取指定tag的内容

[html] view plaincopy

soup.title

The Dormouse's story

soup.title.name

u'title'

soup.title.string

u'The Dormouse's story'

soup.title.parent.name

u'head'

soup.p

The Dormouse's story

soup.a

Elsie

上面示例给出了4个方面

1：获取tag

soup.title

2：获取tag名称

soup.title.name

3：获取title tag的内容

soup.title.string

4：获取title的父节点tag的名称

soup.title.parent.name

怎么样，非常对象化的使用吧。

提取tag属性

下面要说一下如何提取href等属性。

[html] view plaincopy

soup.p['class']

u'title'

获取属性。方法是

soup.tag['属性名称']

[html] view plaincopy

watsy's blog

常见的应该是如上的提取联接。

代码是

[html] view plaincopy

soup.a['href']

相当easy吧。

查找与判断

接下来进入重要部分。全文搜索查找提取.

soup提供find与find_all用来查找。其中find在内部是调用了find_all来实现的。因此只说下find_all

[html] view plaincopy

def find_all(self, name=None, attrs={}, recursive=True, text=None,

limit=None, **kwargs):

看参数。

第一个是tag的名称，第二个是属性。第3个选择递归，text是判断内容。limit是提取数量限制。**kwargs 就是字典传递了。。

举例使用。

[html] view plaincopy

tag名称

soup.find_all('b')

[The Dormouse's story]

正则参数

import re

for tag in soup.find_all(re.compile("^b")):

print(tag.name)

body

b

for tag in soup.find_all(re.compile("t")):

print(tag.name)

html

title

列表

soup.find_all(["a", "b"])

[The Dormouse's story,

Elsie,

Lacie,

Tillie]

函数调用

def has_class_but_no_id(tag):

return tag.has_attr('class') and not tag.has_attr('id')

soup.find_all(has_class_but_no_id)

[The Dormouse's story

,

Once upon a time there were...

,

...

]

tag的名称和属性查找

soup.find_all("p", "title")

[The Dormouse's story

]

tag过滤

soup.find_all("a")

[Elsie,

Lacie,

Tillie]

tag属性过滤

soup.find_all(id="link2")

[Lacie]

text正则过滤

import re

soup.find(text=re.compile("sisters"))

u'Once upon a time there were three little sisters; and their names were\n'

获取内容和字符串

获取tag的字符串

[html] view plaincopy

title_tag.string

u'The Dormouse's story'

注意在实际使用中应该使用 unicode(title_tag.string)来转换为纯粹的string对象

使用strings属性会返回soup的构造1个迭代器，迭代tag对象下面的所有文本内容

[html] view plaincopy

for string in soup.strings:

print(repr(string))

u"The Dormouse's story"

u'\n\n'

u"The Dormouse's story"

u'\n\n'

u'Once upon a time there were three little sisters; and their names were\n'

u'Elsie'

u',\n'

u'Lacie'

u' and\n'

u'Tillie'

u';\nand they lived at the bottom of a well.'

u'\n\n'

u'...'

u'\n'

获取内容

.contents会以列表形式返回tag下的节点。

[html] view plaincopy

head_tag = soup.head

head_tag

The Dormouse's story

head_tag.contents

[The Dormouse's story]

title_tag = head_tag.contents[0]

title_tag

The Dormouse's story

title_tag.contents

[u'The Dormouse's story']

想想，应该没有什么其他的了。。其他的也可以看文档学习使用。

总结

其实使用起主要是

[html] view plaincopy

soup = BeatifulSoup(data)

soup.title

soup.p.['title']

divs = soup.find_all('div', content='tpc_content')

divs[0].contents[0].string

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航

[转]python下很帅气的爬虫包 - Beautiful Soup 示例

The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were

Elsie

,

Lacie

and

Tillie

; and they lived at the bottom of a well.

...

The Dormouse's story

u'title'

u'The Dormouse's story'

u'head'

The Dormouse's story

Elsie

u'title'

[The Dormouse's story]

body

b

html

title

[The Dormouse's story,

Elsie,

Lacie,

Tillie]

[The Dormouse's story,

Once upon a time there were...,

...]

[The Dormouse's story]

[Elsie,

Lacie,

Tillie]

[Lacie]

u'Once upon a time there were three little sisters; and their names were\n'

u'The Dormouse's story'

u"The Dormouse's story"

u'\n\n'

u"The Dormouse's story"

u'\n\n'

u'Once upon a time there were three little sisters; and their names were\n'

u'Elsie'

u',\n'

u'Lacie'

u' and\n'

u'Tillie'

u';\nand they lived at the bottom of a well.'

u'\n\n'

u'...'

u'\n'

The Dormouse's story

The Dormouse's story

[u'The Dormouse's story']

[The Dormouse's story

,

Once upon a time there were...

,

...

]

[The Dormouse's story

]