您的位置：首页 > 编程语言 > Python开发

《Python 编程快速上手 — 让繁琐工作自动化》读书笔记之【第11章从Web抓取信息】

2018-03-10 22:05 1026 查看

1. 项目：利用 webbrowser 模块的 mapIt.py

webbrowser 模块的open()函数可以启动一个新浏览器，打开指定的 URL。示例：import webbrowser

webbrowser.open('https://wwww.baidu.com')

1) 命令行参数

sys.argv 变量保存了程序的文件名和命令行参数的列表。如果这个列表中不只有文件名，那么 len(sys.argv)的返回值就会大于 1，这意味着确实提供了命令行参数。

2) 处理剪贴板内容，加载浏览器

import sys, webbrowser, pyperclip

if len(sys.argv) > 0:

#从命令行获取地址，命令行参数（argv）列表第一个元素为文件名，所以取第二个元素以后的内容
address = ''.join(sys.argv[1:])
#从剪切板获取地址address = pyperclip.paste()
webbrowser.open('https://www.google.com/maps/place/' + address)

2. 用 requests模块从Web下载文件

requests模块属于第三方模块，使用前必须先安装。

1) 用 requests.get()函数下载一个网页

requests模块的get()函数接受一个URL，返回一个Response对象。示例：>>> import requests

>>> res =requests.get('http://www.gutenberg.org/cache/epub/1112/pg1112.txt')

>>> type(res)

<class 'requests.models.Response'>

>>> res.status_ _code ==requests.codes.ok

True

>>> len(res.text)

178981

>>> print(res.text[:250])

The Project Gutenberg EBook of Romeo andJuliet, by William Shakespeare

This eBook is for the use of anyoneanywhere at no cost and with

almost no restrictions whatsoever. You maycopy it, give it away or

re-use it under the terms of the Proje

2) 检查错误（raise_for_status()）

如果要判断下载是否成功，可以使用Response对象的raise_for_status()方法。如果下载出错，将抛出异常。示例：import requests

res =requests.get('http://inventwithpython.com/page_that_does_not_exist')

try:

res.raise_for_status()

except Exception as exc:

print('There wasa problem: %s' % (exc))

3. 将下载的文件保存到硬盘

下载并保存到文件的完整过程如下：
a)     调用 requests.get()下载该文件。
b)     用'wb'调用open()，以写二进制的方式打开一个新文件。
c)     利用 Respose 对象的 iter_content()方法做循环。
d)     在每次迭代中调用 write()，将内容写入该文件。
e)     调用 close()关闭该文件
示例：import requests

res =requests.get('http://www.gutenberg.org/cache/epub/1112/pg1112.txt')#   必须以“二进制”的方式打开文件，以便处理Unicode编码with open('RomeoAndJuliet.txt','wb') asfile:

try:

res.raise_for_status()

# 10 万字节通常是不错的选择

for i in res.iter_content(100000):

file.write(i)

file.close()

print('从Web下载的文件已经保存到硬盘')

except Exception as exc:

print('下载文件遇到问题: %s' % exc)

4. HTML

下面是书中推荐的学习HTML的网站：
• http://htmldog.com/guides/html/beginner/ • http://www.codecademy.com/tracks/web/ • https://developer.mozilla.org/en-US/learn 4000
/html/
注：国内的网站则可以参考http://www.w3school.com.cn/
5. 用 BeautifulSoup 模块解析 HTML
虽然安装使用的是pip install beautifulsoup4，但是实际导入的模块是import bs4

1) 从 HTML 创建一个 BeautifulSoup 对象

向bs4.BeautifulSoup()函数传入要解析的html就可以获得一个BeautifulSoup对象。示例：import requests, bs4

res = requests.get('http://nostarch.com')

res.raise_for_status()# 指定解释器为html5lib,html5lib也要提前安装，如果不指定解释器会报warmingnoStarchSoup =bs4.BeautifulSoup(res.text,'html5lib')

print(type(noStarchSoup))

2) 用 select()方法寻找元素

大多数常用 CSS 选择器的模式：

bs4.select()函数获得一个Tag对象的列表。示例：>>> import bs4

>>> exampleFile =open('example.html')

>>> exampleSoup =bs4.BeautifulSoup(exampleFile.read())

>>> elems =exampleSoup.select('#author')

>>> type(elems)

<class 'list'>

>>> len(elems)

1针对 BeautifulSoup 对象中的 HTML 的每次匹配，列表中都有一个 Tag对象。接上面代码：>>> type(elems[0])

<class 'bs4.element.Tag'>Tag 值可以传递给 str()函数，显示它们代表的 HTML 标签（Tag指转字符串值）。接上面代码：>>> elems[0].getText()

'Al Sweigart'

>>> str(elems[0])

'<span id="author">AlSweigart</span>'Tag 值有attrs 属性，它将该 Tag 的所有 HTML 属性作为一个字典。接上面代码：>>> elems[0].attrs
{'id': 'author'}

3) 通过元素的属性获取数据

使用Tag对象的get()方法可以获取相关属性的数据。示例：>>> import bs4

>>> soup =bs4.BeautifulSoup(open('example.html'))

>>> spanElem =soup.select('span')[0]

>>> str(spanElem)

'<span id="author">AlSweigart</span>'

>>> spanElem.get('id')

'author'

>>> spanElem.get('some__nonexistent_ _addr') == None

True

>>> spanElem.attrs

{'id': 'author'}

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： python 抓取信息

相关文章推荐

新的分享

章节导航

《Python 编程快速上手 — 让繁琐工作自动化》读书笔记之【第11章 从Web抓取信息】