您的位置：首页 > 理论基础 > 计算机网络

网络爬虫之BeautifulSoup入门（四）

2016-12-18 20:52 232 查看

5.带更多参数的find方法

官方文档给出的find方法的参数如下：find( name , attrs , recursive , string , **kwargs )，总体来看和find_all方法的参数没什么不同，在这里仍以示例的方法给出常见的使用方法：

两种方法的使用大致相同，注意以下两种写法都可以且输出结果一致，但显然使用find方法更方便。

soup.find_all('title', limit=1)
# [<title>The Dormouse's story</title>]

soup.find('title')
# <title>The Dormouse's story</title>

在这里一定要注意：find_all方法的返回值为列表，而find直接返回结果；同时在没有找到目标时，find_all返回空的列表，而find将返回None。

6. 输出格式及编码

- 使用prettify方法可以将BeautifulSoup对象格式化输出，这在大型项目内是非常有用的。当然也可以对对象的某一个tag节点使用该方法，如下：

markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup)

print(soup.a.prettify())
# <a href="http://example.com/">
#  I linked to
#  <i>
#   example.com
#  </i>
# </a>

若只想得到结果字符串，而不注重格式的话，可以使用str方法，如下：

str(soup.a)
#'<a href="http://example.com/">I linked to <i>example.com</i></a>'

7.get_text()

若想得到tag中包含的文本内容，可以使用get_text()方法，如下：

soup.get_text()
u'\nI linked to example.com\n'
soup.i.get_text()
u'example.com'

8.实践

给出一个实践项目源码地址：网页表格抓取

介绍：项目内爬虫部分主要应用了get_text,find,find_all，prettify等方法,实现给定URL地址的网页表格提取存储、展示等。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： 网络爬虫 string table表格提取

相关文章推荐

新的分享

章节导航