您的位置:首页 > 编程语言 > Python开发

【python爬虫专题】解析方法 <4> BeautifulSoup库学习

2018-02-04 14:42 513 查看
BeautifulSoup库支持的解析库:

解析器调用方法特点
Python标准库BeautifulSoup(markup,”html.parser”)Python内置的标准库,处理速度适中,文档容错能力强
lxml HTML解析器BeautifulSoup(markup,’lxml’)速度快,文档容错能力强,需要c语言支持库
lxml XML解析器BeautifulSoup(markup,’xml’速度快,唯一支持XML的解析器,需要某些C语言支持库
html5libBeautifulSoup(markup,’html5lib’具有最好的容错性,以浏览器的的方式解析文档,但是速度很慢
下面是一段网页节选的代码(选自<<鲁兵逊漂流记>>), 用lxml HTML解析器处理:

html = """
<p><b>4 A new life on an island </b></p>
<p>When day came,the sea was quiet again. I looked for our ship and,to my surprise,it was still there and still in one piece. 'I think I can swim to it,'I said to myself. So I walked down to the sea and before long,I was at the ship and was swimming round it. But how could I get on to it?In the end,I got in through a hole in the side,but it wasn't easy.</p>
<p>There was a lot of water in the ship,but the sand under the sea was still holding the ship in one place. The back of the ship was high out of the water,and I was very tnankful for this be-cause all the ship's food was there. I was very hungry so I be-gan to eat something at once. Then I decided to take some of it back to the shore with me. But how could I get it there?</p>
<p>I looked around the ship,and after a few minutes,I found some long pieces of wood. I tied them together with rope. Then I got the things that I wanted from the ship. There was a big box of food—rice,and salted meat,and hard ship's bread. I al-so took many strong knives and other tools,the ship's sails and ropes,paper,pens,books,and seven guns. Now I needed a little sail from the ship,and then I was ready. Slowly and carefully,I went back to the shore. It was difficult to stop my things from falling into the sea,but in the end I got everything on to the shore.</p>
<p>Now I needed somewhere to keep my things.</p>
<p>There were some hills around me,so I decided to build my-self a little house on one of them. I walked to the top of the highest hill and looked down,I was very unhappy,because I saw then that I was on an island. There were two smaller is-lands a few miles away,and after that,only the sea. Just the sea,for mile after mile after mile.</p>
<p>After a time,I found a little cave in the side of a hill. In front of it,there was a good place to make a home. So,I used the ship's sails,rope,and pieces of wood,and after a lot of hard work I had a very fine tent. The cave at the back of my tent was a good place to keep my food,and so I called it my 'kitchen'. That night,I went to sleep in my new home.</p>
<p>The next day I thought about the possible dangers on the is-land. Were there wild animals,and perhaps wild people too,on my island?I didn't know,but I was very afraid. So I decided to build a very strong fence. I cut down young trees and put them in the ground,in a halfcircle around the front of my tent. I used many of the ship's ropes too,and in the end my fence was as strong as a stone wall. Nobody could get over it,through it,or round it.</p>
<p>Making tents and building fences is hard work. I needed many tools to help me. So I decided to go back to the ship again,and get some more things.</p>
<p>I went back twelve times,but soon after my twelfth visit there was another terrible storm. The next morning,when I looked out to sea,there was no ship.</p>
<p>When I saw that,I was very unhappy. 'Why am I alive,and why are all my friends dead?'I asked myself. 'What will hap-pen to me now,alone on this island without friends?How can I ever escape from it?'</p>
<p>Then I told myself that I was lucky—lucky to be alive,lucky to have food and tools,lucky to be young and strong. But I knew that my island was somewhere off the coast of South America. Ships did not often come down this coast,and I said to myself,'I'm going to be on this island for a long time. 'So,on a long piece of wood,I cut these words:</p>
<p>I CAME HERE ON 30TH SEPTEMBER 1659 After that,I decided to make a cut for each day.</p>
"""

from bs4 import BeautifulSoup as bs
soup = bs(html,'lxml')
print(soup.prettify())


运行之后便可得知, prettify() 方法把代码进行了”美化”,让代码变得更加工整

使用标签选择器:

直接调用作为标签名的方法,如

html = '...'
from bs4 <
4000
span class="hljs-keyword">import BeautifulSoup as bs
soup = BeautifulSoup(html, 'lxml')
print(soup.title)
# print(type(soup.title)) # 查看title方法所对应的数据结构
print(soup.head)
print(soup.p)# 注意当文档中出现多个p标签文本的时候,这种调用方法只返回第一个p标签下的内容
print(soup.p.name)#获取标签的名称

#若获取标签下某个属性的值(例如"name"属性),有两种方法
print(soup.p.attrs['name'])
print(soup.p['name'])

# 获取某个标签内的文字内容,可以使用string()方法
print(soup.p.string)


注意:标签之下还有子标签, 比方说< body >标签之下出现< p >标签而我们想访问这个p标签的时候,我们就可以用嵌套的方式层层调用. 如:

print(soup.body.p.string) # 获得p标签的内容

  当我们想知道某个标签子节点的有哪些内容的时候,我们可以调用 content() 方法, 返回一个内容为html代码的列表. 列表中每一个元素是子标签的内容. 这样我们就能够方便得知某个标签下的内容了. 当某个标签用 content() 返回的结果是None的时候,就说明这个是一个叶节点,没有子节点了.

当然,访问子节点还有 children() 方法, 这个方法返回的对象是一个迭代器. 用for循环遍历这个迭代器就能够得到我们想获得所有子节点内容. 若要返回的是子孙节点,则调用descendants() 方法,返回的也是一个迭代器.

同理, 也能获取父节点 (调用parent() 方法) , 获取祖先节点 (parents()方法) , 兄弟节点 (next_siblings()方法, previous_siblings()方法) , 在此不做累述.

使用标准选择器:

在这里主要介绍最常使用的 find(), find_all() 两种方法

find_all( name , attrs , recursive , text , **kwargs )

这个方法可以让我们根据标签名name, 标签属性attrs, 内容text等来查找文档.

传入name, attrs

html='''
<div class="dannial">
<div class="dannial-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Jan</li>
<li class="element">Feb</li>
<li class="element">Mar</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Apr</li>
<li class="element">May</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all('ul'))
print(soup.find_all(attrs={"class":"element"}))


可以看到, 程序的第二行返回的是一个以整个 < ul > 标签为一个元素的list, html中有几个< ul >就对应list中有几个元素.

传入attrs属性时,我们传入一个字典, 结果以列表形式返回包含这个键值对的所有标签. 如果这个标签包含子标签,那么他的子标签也一并打印出来.

find( name , attrs , recursive , text , **kwargs )

这个方法和find_all() 方法类似,只不过只返回符合查找的第一个对象.

与之相类似的,还有这些方法:

find_parents() find_parent()

find_parents()返回所有祖先节点,find_parent()返回直接父节点。

find_next_siblings() find_next_sibling()

find_next_siblings()返回后面所有兄弟节点,find_next_sibling()返回后面第一个兄弟节点。

find_previous_siblings() find_previous_sibling()

find_previous_siblings()返回前面所有兄弟节点,find_previous_sibling()返回前面第一个兄弟节点。

find_all_next() find_next()

find_all_next()返回节点后所有符合条件的节点, find_next()返回第一个符合条件的节点

find_all_previous() 和 find_previous()

find_all_previous()返回节点后所有符合条件的节点, find_previous()返回第一个符合条件的节点

css选择器

CSS选择器语法一览表

使用css选择器时,推荐配合调用select() 方法 .返回形式和find_all()方法类似,有些时候传入来的是迭代器,可以for循环遍历即可.
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  爬虫 BeautifulSoup