您的位置：首页 > 理论基础 > 计算机网络

Python网络数据采集（1）——获取网页源码

2018-01-08 16:10 621 查看

from bs4 import BeautifulSoup
from urllib.request import urlopen
html = urlopen("http://www.baidu.com/")
text = BeautifulSoup(html.read(), "html.parser")
print(text)

《Python网络数据采集》原书上第四段代码写的是BeautifulSoup(html.read())，这样写可能会出现一个报错：

UserWarning: No parser was explicitly specified
To get rid of this warning, change this: BeautifulSoup([your markup])
to this:
BeautifulSoup([your markup], “html.parsar”)
markup_type=markup_type))

这是因为没有指定网页解析工具，在不同的操作系统中，未指定网页解析工具，python可能会根据系统的系统自己选择不同的网页解析工具，那么有些写出来的代码就可能导致报错或警告。根据python提示的改法，我们可以加上”html.parser”

一、介绍使用的库

1、urllib

urllib提供一系列用于操作URL的功能，其中urllib的request模块可以菲方方便的抓取URL内容，也就是发送一个GET/POST请求到指定页面，然后返回HTTP的响应；通过urllib模块，可以把请求伪装成浏览器。

urllib是Python的标准库，包含从网络请求数据、处理cookie、改变请求头和用户代理这些元数据的函数

注：在python2有urllib和urllib2两个库，而python3中urllib2已改为urllib，被分成几个子模块：urllib.request，urllib.parse，urllib.error。在使用python3时，导入包要写from urllib.request import urlopen，直接写import urllib.request会报错

urlopen用来打开并读取一个从网络获取的远程对象，读取HTML文件、图像文件或其他流文件

2、BeautifulSoup

BeaurifulSoup库可以从HTML或XML文件中提取数据，将其解析为树形结构，然后方便地获取指定标签的对应属性.

二、增强代码严谨性

为增强代码的可读性，可以将上方代码改为可读性更强、更严谨的代码：

from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup
def getTitle(url):
try:
html = urlopen(url)
except HTTPError as e:
return None
try:
bsObj = BeautifulSoup(html.read(), "html.parser")
title = bsObj.body.h1
except AttributeError as e:
return None
return title
title = getTitle("http://music.163.com/")
if title is None:
print("Title could not be found")
else:
print(title)

由于在联网过程中，会出现这样几种异常：网页输入错误、网页在服务器上不存在、服务器不存在…所以在处理代码时应当写出更严谨的代码来避免这些错误（方便查找错误）

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： python 爬虫 beautifulsoup 网页解析

相关文章推荐

新的分享

章节导航