您的位置:首页 > 编程语言 > Python开发

python网络数据采集——初见网络爬虫

2019-04-21 12:19 381 查看

一、一个简单爬虫:

from urllib import request
html = request.urlopen("http://baidu.com")
print(html.read())

二、BeautifulSoup

  • 简单示例:
from urllib import request
from bs4 import BeautifulSoup
html = request.urlopen("https://blog.csdn.net/qq_34908167/article/details/78849590")
bs = BeautifulSoup(html.read(),"html.parser")
print(bs.a)

三、返回异常:

网络是复杂的,每个网站的格式也有所不同,网络数据采集会发生各种各样的情况,为了避免做无用功,我们需要在程序中添加返回异常的代码,以便及时观察到代码运行状况。

一般会返回三种异常:
1.网页不存在,或获取页面时发生错误
2.服务器不存在
3.服务器成功获取,但程序的需求标签错误。

  • 发生异常是,可以使用如下方法处理:
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError

try:
html = urlopen("https://pythonscrapingthisurldoesnotexist.com")
except HTTPError as e:
print("The server returned an HTTP error")
except URLError as e:
print("The server could not be found!")
else:
print(html.read())

四、最终版代码;

from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup

def getTitle(url):
try:
html = urlopen(url)
except HTTPError as e:
return None
try:
bsObj = BeautifulSoup(html.read(), "lxml")
title = bsObj.body.h1
except AttributeError as e:
return None
return title

title = getTitle("http://www.pythonscraping.com/pages/page1.html")
if title == None:
print("Title could not be found")
else:
print(title)
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: