您的位置：首页 > 理论基础 > 计算机网络

【Python网络爬虫】python网络数据采集读书笔记（第二章）

2016-12-26 16:11 246 查看

python网络数据采集

第二章复杂HTML解析

demo1

这个demo展示了利用BS4，解析css来抽离出小说中的人物的登场次序。这个网址可以打开看看，也许你就明白作者的意图了。

from urllib.request import urlopen
from bs4 import BeautifulSoup
#下面这个网址是作者弄的示例页面
html=urlopen('http://www.pythonscraping.com/pages/warandpeac
4000
e.html')
bsobj=BeautifulSoup(html)

namelist=bsobj.findAll('span',{'class':'green'})
for name in namelist:
print(name.get_text())
#.get_text()是bs4中的函数，用于将html文档中的所有标签都清除，只包含文字

demo2

解释find（）函数和findAll（）函数

findAll(tag,attributes,resursive,text,limit,keywords)

find(tag,attributes,resursive,text,limit,keywords)

#tag,传入一个标签的名称或者多个标签名称组成python列表做标签参数如
html.findAll({'h1','h2','h3'})

#attributes,用一个python字典封装一个标签的若干属性和对应的属性值。如
html.findAll('span',{'class':{'green','red'}})

#recursive是一个布尔变量。值为True则会按照你的要求去爬取所有子标签，否则只查找文档的一级标签。

#text，用标签的文本内容去匹配

#limit,范围限制参数

#keyword,可以选择那些具有指定属性的标签

demo3

介绍下BeautifulSoup的几个对象

- BeautifulSoup对象

- 标签Tag对象

- NavigableString对象

- Comment对象

demo4

处理子标签

from urllib.request import urlopen
from bs4 import BeautifulSoup
html=urlopen('http://www.pythonscraping.com/pages/page3.html')
bsobj=BeautifulSoup(html)
for child in bsobj.find('table',{'id':'giftList'}).children:
print(child)

demo5

处理兄弟标签

from urllib.request import urlopen
from bs4 import BeautifulSoup

html=urlopen('http://www.pythonscraping.com/pages/page3.html')
bsobj=BeautifulSoup(html)

for sibling in bsobj.find('table',{'id':'giftList'}).tr.next_siblings:
print(sibling)
#打印产品列表中的所有行的产品，第一行表格标题除外

demo6

处理父标签

from urllib.request import urlopen
from bs4 import BeautifulSoup

html=urlopen('http://www.pythonscraping.com/pages/page3.html')
bsobj=BeautifulSoup(html)
print(bsobj.find('img',{'src':'../img/gifts/img1.jpg'}).parent.previous_sibling.get_text())

demo7

正则表达式

demo8

正则表达式与BeautifulSoup组合使用

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re  #用于正则表达式的库

html=urlopen("http://www.pythonscraping.com/pages/page3.html")
bsobj=BeautifulSoup(html)
images=bsobj.findAll('img',{'src':re.compile("\.\.\/img\/gifts/img.*\.jpg")})
for image in images:
print(image["src"])

至于第二章最后一个Lambda表达式，作者讲的很笼统，并且说这是正则的完美替代方案，个人觉得既然用正则，就不想着去了解它了。遂第二章结束。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： python html 读书笔记网络爬虫

相关文章推荐

新的分享

章节导航