您的位置：首页 > 其它

遍历文档树和搜索文档树常用的函数笔记

2019-07-11 12:05 387 查看

遍历文档树

在BeautifulSoup中，一个标签Tag可能包含多个字符串和其他的标签，这些称为该标签的子标签。

1、子节点

在BeautifulSoup中通过contents值获取标签的子节点内容，并与列表的形式输出

#-*-coding:utf-8-*-
from bs4 import BeautifulSoup

soup = BeautifulSoup(open('G:/MyPoem.html'),"html.parser")
print soup.head.contents

[u'\n', <title>BeautifulSoup\u6280\u672f\u5b66\u4e60</title>, u'\n']

由于

<title>和</title>

之间存在两个换行，如果需要提取第二个元素，则代码如下：

print soup.head.contents[1]

<title>BeautifulSoup技术学习</title>

获取内容：

>>> print soup.title.contents[0]
BeautifulSoup技术学习
>>>

另一个获取子节点的方法是children关键字，但它返回的不是一个列表，而是一个可以遍历的方法获取所有节点的内容。代码如下：

>>> print soup.head.children
<listiterator object at 0x00000000035E3828>
>>>

>>> for child in soup.head.children:
print child

输出结果：

<title>BeautifulSoup技术学习</title>

contents和children属性仅包含标签的子节点，如果需要获取Tag的所有子节点，甚至子孙节点，则需要使用descendants（译为：后代）属性，所有的HTML标签都打印出来了，需要Unicode转换编码，代码如下：

for c in soup.descendants:
print unicode(c)

2、节点内容

如果标签只有一个子节点，且需要获取该子节点的内容，则需要使用string属性输出子节点的内容，通常返回标签最里层的内容

>>> print (soup.head.string)
None

>>> print (soup.title.string)
BeautifulSoup技术学习
>>>

当标签包含多个子节点时，Tag就无法确定string获取那个子节点的内容，就输出结果为 None ，若需要获取多个子节点的内容，则使用strings属性，代码如下：

for content in soup.strings:
print content

但输出的字符串可能包含多余的空格或换行，此时需要用stripped_strings方法去除多余的空白内容，代码如下：

for content in soup.stripped_strings:
print content

输出结果如图：

3、父节点

调用parent属性定位父节点，如果需要获取节点的标签名则使用 parent.name ,代码及运行结果如下：

>>> content = soup.head.title.string
>>> print content.parent
<title>BeautifulSoup技术学习</title>>>>
>>> print content,parent.name
BeautifulSoup技术学习 [document]
>>>

如果需要获取所有的父节点，则使用parents属性循环获取，代码如下：

content = soup.head.title.string
for parent in content.parents:
print parent.name

输出结果：

title
head
html
[document]
>>>

4、兄弟节点

兄弟节点是指和本节点位于同一级的节点，其中next_sibling属性获取该节点的下个兄弟节点，previous_sibling属性则是获取该节点的上个兄弟节点，如果节点不存在则返回None。
代码运行结果如图：

注意: 实际文档中Tag的next_sibling 和previous_sibling 属性通常都是字符串或空白，因为空白或者换行也可以被视作一个节点，所以得到的结果可能是空白或者换行。同理,通过next_ siblings 和previous siblings 属性可以获取当前节点的所有兄弟节点，然后再调用循环迭代输出。

5、前后节点

调用属性next_element 可以获取下一个节点，调用属性previous_element 可以获取上一个节点,代码运行如下图:
同理，通过next_element和previous_element属性获取当前所有的节点，然后再调用循环迭代输出。

搜索文档树

对于搜索文档树，主要讲解find_all()方法，这是常用的方法，如果想从网页中获取所有的

<a>

标签，使用find_all()方法的代码如下：

>>> url = soup.find_all('a')
>>> for a in url:
print a

<a class="gufe" href="http://www.gufe.edu.cn/www/" id="school">贵州财经大学</a>
<a href="https://gzszfzx.30edu.com.cn/">贞丰中学</a>
>>>

如果想要同时获取a标签和b标签的值，则使用函数如下：

soup.find_all(["a","b"])

find_all()函数可以接受参数进行指定节点查询，代码如下：

注意： 定点查询只对 id 有作用，对class是无效语法

但通过class可以接受多个参数，输出class属性相同的标签，但在这个HTML文档 a 标签中只有一个class属性值，所以只能输出一个结果：

特别注意：class后面有下滑线

学习课本: 《python网络数据爬取及分析从入门到精通（爬取篇）》杨秀璋颜娜编著

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航