【python爬虫专题】解析方法 <4> BeautifulSoup库学习
2018-02-04 14:42
513 查看
BeautifulSoup库支持的解析库:
下面是一段网页节选的代码(选自<<鲁兵逊漂流记>>), 用lxml HTML解析器处理:
运行之后便可得知, prettify() 方法把代码进行了”美化”,让代码变得更加工整
注意:标签之下还有子标签, 比方说< body >标签之下出现< p >标签而我们想访问这个p标签的时候,我们就可以用嵌套的方式层层调用. 如:
print(soup.body.p.string) # 获得p标签的内容
当我们想知道某个标签子节点的有哪些内容的时候,我们可以调用 content() 方法, 返回一个内容为html代码的列表. 列表中每一个元素是子标签的内容. 这样我们就能够方便得知某个标签下的内容了. 当某个标签用 content() 返回的结果是None的时候,就说明这个是一个叶节点,没有子节点了.
当然,访问子节点还有 children() 方法, 这个方法返回的对象是一个迭代器. 用for循环遍历这个迭代器就能够得到我们想获得所有子节点内容. 若要返回的是子孙节点,则调用descendants() 方法,返回的也是一个迭代器.
同理, 也能获取父节点 (调用parent() 方法) , 获取祖先节点 (parents()方法) , 兄弟节点 (next_siblings()方法, previous_siblings()方法) , 在此不做累述.
可以看到, 程序的第二行返回的是一个以整个 < ul > 标签为一个元素的list, html中有几个< ul >就对应list中有几个元素.
传入attrs属性时,我们传入一个字典, 结果以列表形式返回包含这个键值对的所有标签. 如果这个标签包含子标签,那么他的子标签也一并打印出来.
与之相类似的,还有这些方法:
find_parents() find_parent()
find_parents()返回所有祖先节点,find_parent()返回直接父节点。
find_next_siblings() find_next_sibling()
find_next_siblings()返回后面所有兄弟节点,find_next_sibling()返回后面第一个兄弟节点。
find_previous_siblings() find_previous_sibling()
find_previous_siblings()返回前面所有兄弟节点,find_previous_sibling()返回前面第一个兄弟节点。
find_all_next() find_next()
find_all_next()返回节点后所有符合条件的节点, find_next()返回第一个符合条件的节点
find_all_previous() 和 find_previous()
find_all_previous()返回节点后所有符合条件的节点, find_previous()返回第一个符合条件的节点
使用css选择器时,推荐配合调用select() 方法 .返回形式和find_all()方法类似,有些时候传入来的是迭代器,可以for循环遍历即可.
解析器 | 调用方法 | 特点 |
---|---|---|
Python标准库 | BeautifulSoup(markup,”html.parser”) | Python内置的标准库,处理速度适中,文档容错能力强 |
lxml HTML解析器 | BeautifulSoup(markup,’lxml’) | 速度快,文档容错能力强,需要c语言支持库 |
lxml XML解析器 | BeautifulSoup(markup,’xml’ | 速度快,唯一支持XML的解析器,需要某些C语言支持库 |
html5lib | BeautifulSoup(markup,’html5lib’ | 具有最好的容错性,以浏览器的的方式解析文档,但是速度很慢 |
html = """ <p><b>4 A new life on an island </b></p> <p>When day came,the sea was quiet again. I looked for our ship and,to my surprise,it was still there and still in one piece. 'I think I can swim to it,'I said to myself. So I walked down to the sea and before long,I was at the ship and was swimming round it. But how could I get on to it?In the end,I got in through a hole in the side,but it wasn't easy.</p> <p>There was a lot of water in the ship,but the sand under the sea was still holding the ship in one place. The back of the ship was high out of the water,and I was very tnankful for this be-cause all the ship's food was there. I was very hungry so I be-gan to eat something at once. Then I decided to take some of it back to the shore with me. But how could I get it there?</p> <p>I looked around the ship,and after a few minutes,I found some long pieces of wood. I tied them together with rope. Then I got the things that I wanted from the ship. There was a big box of food—rice,and salted meat,and hard ship's bread. I al-so took many strong knives and other tools,the ship's sails and ropes,paper,pens,books,and seven guns. Now I needed a little sail from the ship,and then I was ready. Slowly and carefully,I went back to the shore. It was difficult to stop my things from falling into the sea,but in the end I got everything on to the shore.</p> <p>Now I needed somewhere to keep my things.</p> <p>There were some hills around me,so I decided to build my-self a little house on one of them. I walked to the top of the highest hill and looked down,I was very unhappy,because I saw then that I was on an island. There were two smaller is-lands a few miles away,and after that,only the sea. Just the sea,for mile after mile after mile.</p> <p>After a time,I found a little cave in the side of a hill. In front of it,there was a good place to make a home. So,I used the ship's sails,rope,and pieces of wood,and after a lot of hard work I had a very fine tent. The cave at the back of my tent was a good place to keep my food,and so I called it my 'kitchen'. That night,I went to sleep in my new home.</p> <p>The next day I thought about the possible dangers on the is-land. Were there wild animals,and perhaps wild people too,on my island?I didn't know,but I was very afraid. So I decided to build a very strong fence. I cut down young trees and put them in the ground,in a halfcircle around the front of my tent. I used many of the ship's ropes too,and in the end my fence was as strong as a stone wall. Nobody could get over it,through it,or round it.</p> <p>Making tents and building fences is hard work. I needed many tools to help me. So I decided to go back to the ship again,and get some more things.</p> <p>I went back twelve times,but soon after my twelfth visit there was another terrible storm. The next morning,when I looked out to sea,there was no ship.</p> <p>When I saw that,I was very unhappy. 'Why am I alive,and why are all my friends dead?'I asked myself. 'What will hap-pen to me now,alone on this island without friends?How can I ever escape from it?'</p> <p>Then I told myself that I was lucky—lucky to be alive,lucky to have food and tools,lucky to be young and strong. But I knew that my island was somewhere off the coast of South America. Ships did not often come down this coast,and I said to myself,'I'm going to be on this island for a long time. 'So,on a long piece of wood,I cut these words:</p> <p>I CAME HERE ON 30TH SEPTEMBER 1659 After that,I decided to make a cut for each day.</p> """ from bs4 import BeautifulSoup as bs soup = bs(html,'lxml') print(soup.prettify())
运行之后便可得知, prettify() 方法把代码进行了”美化”,让代码变得更加工整
使用标签选择器:
直接调用作为标签名的方法,如html = '...' from bs4 < 4000 span class="hljs-keyword">import BeautifulSoup as bs soup = BeautifulSoup(html, 'lxml') print(soup.title) # print(type(soup.title)) # 查看title方法所对应的数据结构 print(soup.head) print(soup.p)# 注意当文档中出现多个p标签文本的时候,这种调用方法只返回第一个p标签下的内容 print(soup.p.name)#获取标签的名称 #若获取标签下某个属性的值(例如"name"属性),有两种方法 print(soup.p.attrs['name']) print(soup.p['name']) # 获取某个标签内的文字内容,可以使用string()方法 print(soup.p.string)
注意:标签之下还有子标签, 比方说< body >标签之下出现< p >标签而我们想访问这个p标签的时候,我们就可以用嵌套的方式层层调用. 如:
print(soup.body.p.string) # 获得p标签的内容
当我们想知道某个标签子节点的有哪些内容的时候,我们可以调用 content() 方法, 返回一个内容为html代码的列表. 列表中每一个元素是子标签的内容. 这样我们就能够方便得知某个标签下的内容了. 当某个标签用 content() 返回的结果是None的时候,就说明这个是一个叶节点,没有子节点了.
当然,访问子节点还有 children() 方法, 这个方法返回的对象是一个迭代器. 用for循环遍历这个迭代器就能够得到我们想获得所有子节点内容. 若要返回的是子孙节点,则调用descendants() 方法,返回的也是一个迭代器.
同理, 也能获取父节点 (调用parent() 方法) , 获取祖先节点 (parents()方法) , 兄弟节点 (next_siblings()方法, previous_siblings()方法) , 在此不做累述.
使用标准选择器:
在这里主要介绍最常使用的 find(), find_all() 两种方法find_all( name , attrs , recursive , text , **kwargs )
这个方法可以让我们根据标签名name, 标签属性attrs, 内容text等来查找文档.传入name, attrs
html=''' <div class="dannial"> <div class="dannial-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Jan</li> <li class="element">Feb</li> <li class="element">Mar</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Apr</li> <li class="element">May</li> </ul> </div> </div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.find_all('ul')) print(soup.find_all(attrs={"class":"element"}))
可以看到, 程序的第二行返回的是一个以整个 < ul > 标签为一个元素的list, html中有几个< ul >就对应list中有几个元素.
传入attrs属性时,我们传入一个字典, 结果以列表形式返回包含这个键值对的所有标签. 如果这个标签包含子标签,那么他的子标签也一并打印出来.
find( name , attrs , recursive , text , **kwargs )
这个方法和find_all() 方法类似,只不过只返回符合查找的第一个对象.与之相类似的,还有这些方法:
find_parents() find_parent()
find_parents()返回所有祖先节点,find_parent()返回直接父节点。
find_next_siblings() find_next_sibling()
find_next_siblings()返回后面所有兄弟节点,find_next_sibling()返回后面第一个兄弟节点。
find_previous_siblings() find_previous_sibling()
find_previous_siblings()返回前面所有兄弟节点,find_previous_sibling()返回前面第一个兄弟节点。
find_all_next() find_next()
find_all_next()返回节点后所有符合条件的节点, find_next()返回第一个符合条件的节点
find_all_previous() 和 find_previous()
find_all_previous()返回节点后所有符合条件的节点, find_previous()返回第一个符合条件的节点
css选择器
CSS选择器语法一览表使用css选择器时,推荐配合调用select() 方法 .返回形式和find_all()方法类似,有些时候传入来的是迭代器,可以for循环遍历即可.
相关文章推荐
- 【python爬虫专题】解析方法 <3> 正则表达式学习
- 【python爬虫专题】解析方法 <1> Urllib库方法总结
- 【python爬虫专题】解析方法 <2> Requests库方法总结
- python 爬虫学习<将某一页的所有图片下载下来>
- python爬虫上手 笔记<4>
- <<Python基础教程>>学习笔记 | 第09章 | 魔法方法、属性和迭代器
- Python<4>有关元组
- python学习笔记<os module>
- android4.0 升级中python脚本解析ota_from_target_files<一>
- SQL Server XML基础学习之<6>--XQuery的 value() 方法、 exist() 方法 和 nodes() 方法
- SQL Server XML基础学习之<7>--XML modify() 方法对 XML 数据中插入、更新或删除
- python学习路--< 1 >
- python学习四:import模块方法、可变参数、字典key判断、版本信息获取、列表解析、
- flex 无法将“<mx:>”解析为组件执行.解决方法
- Hello Dojo ! 开始学习Dojo <4>
- RHCE学习<4>SSH、TCP_Wrappers、VNC和磁盘管理
- 黑马程序员 .NET学习笔记 <4>
- iOS学习笔记<20> iOS中的GCD多线程模型 & ios事件的通知方法
- <转载学习>子类对父类构造方法调用小结
- OpenCV 2 学习笔记(13): 算法的基本设计模式<4> :使用Model-View-Controller模式创建一个应用程序