您的位置：首页 > 编程语言 > Python开发

Python网络数据采集笔记第二章

2018-12-02 21:54 232 查看

从find、find_all函数谈起

在此贴出书中对这两个函数的解析

谈谈对html、CSS的个人理解

由于没有涉及过网页编写，单从find_all和find函数输出的网页内容总结出一些规律：（实例：虚拟的在线购物网站自行打开并用Python输出）

1、< >表示tag标签

是find_all函数中引用的tag项，是仅从html的结构意义上的标题。

2、< >内的等式

表示html内容的属性表示，是find_all函数中引用的attributes项。可用于精准搜索某一特征，如文中金色的字体。attributes项表示的内容也可用keywords项代替（注意python中的保护字，如class，作class_处理）。

children()、decendants()、next_siblings()与parents()

用法见书本P17-P19

正则表达式

正则表达式与BeautifulSoup搭配使用，使得搜索内容更加个性化。正则表达式的特征描述在find_all函数的attributes项。示例代码:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html) images = bsObj.findAll("img",{"src":re.compile("\.\.\/img\/gifts/img.*\.jpg")})
for image in images:
print(image["src"])

获取属性

获取tag的属性：BeautifulSoup文件名.tag.attrs[“属性名”]，示例代码:

from urllib.request import urlopen
from bs4 import BeautifulSoup
html=urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj=BeautifulSoup(html)
List=bsObj.find_all("img") #获取所有的图片信息
for l in List:
print(l.attrs["src"])  #打印出图片信息中"src"的属性

Lambda表达式

除了正则表达式外，算是另一种个性化爬取数据的方式。注意事项由书本给出

补充库

若BeautifulSoup库无法满足需求，可参考lxml库、HTML parser库。

待更

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航