您的位置：首页 > 编程语言 > Python开发

Python爬虫入门第一章 Requests库入门

2017-03-30 10:31 357 查看

第一章 Requests库入门

1. 使用命令行安装：

pip install requests

2. Requests库的7个主要方法


方法	说明
requests.request()	构造一个请求，支撑以下各方法的基础方法
requests.get()	获取HTML网页的主要方法,对应于HTTP的GET
requests.head()	获取HTML网页头信息的方法,对应于HTTP的HEAD
requests.post()	向HTML网页提交POST请求的方法,对应于HTTP的POST
requests.put()	向HTML网页提交PUT请求的方法,对应于HTTP的PUT
requests.patch()	向HTML网页提交局部请求,对应于HTTP的PATCH
requests.delete()	向HTML页面提交删除请求,对应于HTTP的DELETE

2.1 requests.get()

requests.get(url,params=None,**kwargs)

url

:拟获取页面的url链接

params

:url中的额外参数，字典或字节流格式，可选

**kwargs

:12个控制访问的参数

该方法返回一个Response对象。

2.2 requests.head()

requests.head(url,**kwargs)

url

:拟获取页面的url链接

**kwargs

:12个控制访问的参数

>>> r = requests.head('http://www.baidu.com')
>>> r.headers
{'Server': 'bfe/1.0.8.18', 'Date': 'Wed, 29 Mar 2017 15:51:08 GMT', 'Content-Type': 'text/html', 'Last-Modified': 'Mon, 13 Jun 2016 02:50:06 GMT', 'Connection': 'Keep-Alive', 'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Pragma': 'no-cache', 'Content-Encoding': 'gzip'}
>>> r.text
''

2.3 requests.post()

requests.post(url, data=None, json=None, **kwargs)

url

:拟获取页面的url链接

data

:字典、字节序列或文件，Request的内容

json

:JSON格式的数据，Request的内容

**kwargs

:12个控制访问的参数

>>> #向URL POST一个字典自动编码为form（表单）
>>> playload = {'key1':'value1','ley2':'value2'}
>>> r = requests.post('http://httpbin.org/post',data = playload)
>>> print(r.text)
{...
"form": {
"key1": "value1",
"ley2": "value2"
}
...
}
>>> #向URL POST一个字符串自动编码为data
>>> r = requests.post('http://httpbin.org/post', data = 'ABC')
>>> print(r.text)
{ ...
"data": "ABC"
"form": {},
}

2.4 requests.put()

requests.put(url, data=None, **kwargs)

url

:拟获取页面的url链接

data

:字典、字节序列或文件，Request的内容

**kwargs

:12个控制访问的参数

>>> payload = {'key1': 'value1', 'key2': 'value2'}
>>> r = requests.put('http://httpbin.org/put', data = payload)
>>> print(r.text)
{ ...
"form": {
"key2": "value2",
"key1": "value1"
},
}

2.5 requests.patch()

requests.patch(url, data=None, **kwargs)

url

:拟获取页面的url链接

data

:字典、字节序列或文件，Request的内容

**kwargs

:12个控制访问的参数

2.6 requests.delete()

requests.delete(url, **kwargs)

url

:拟获取页面的url链接

**kwargs

:12个控制访问的参数

2.7 理解PATCH和PUT分的区别

假设URL位置有一组数据UserInfo，包括UserID、UserName等20个字段

需求：用户修改了UserName，其他不变

采用PATCH，仅向URL提交UserName的局部更新请求

采用PUT，必须将所有20个字段一并提交到URL，未提交字段被删除

PATCH的最主要好处： 节省网络带宽

3. Response对象

>>> import requests
>>> r = requests.get('http://www.baidu.com')
>>> print(r.status_code)
200
>>> type(r)
<class 'requests.models.Response'>
>>> r.headers
{'Server': 'bfe/1.0.8.18', 'Date': 'Wed, 29 Mar 2017 15:23:34 GMT', 'Content-Type': 'text/html', 'Last-Modified': 'Mon, 23 Jan 2017 13:27:32 GMT', 'Transfer-Encoding': 'chunked', 'Connection': 'Keep-Alive', 'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Pragma': 'no-cache', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Content-Encoding': 'gzip'}

Response兑现包含服务器返回的所有信息，也包含请求的Requests信息。

3.1 Response对象的属性


属性	说明
r.status_code	HTTP请求的返回状态，200表示连接成功，404表示失败
r.text	HTTP响应内容的字符串形式，即，url对应的页面内容
r.encoding	header中猜测的响应内容编码方式
r.apparent_encoding	从内容中分析出的响应内容编码方式（备选编码方式）
r.content	HTTP响应内容的二进制形式

>>> r = requests.get('http://www.baidu.com')
>>> r.status_code
200
>>> r.text
'<!DOCTYPE html>\r\n<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>?\x99??o|??\x80??\x8b??\x8c??\xa0?°±?\x9f￥é\x81\x93</title></head>...
>>> r.encoding
'ISO-8859-1'
>>> r.apparent_encoding
'utf-8'
>>> r.encoding = 'utf-8'
>>> r.text
'<!DOCTYPE html>\r\n<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>百度一下，你就知道</title></head>...

r.encoding

：如果header中不存在charset，则认为编码为ISO‐8859‐1。r.text根据r.encoding显示网页内容

r.apparent_encoding

：根据网页内容分析出的编码方式

可以看作是r.encoding的备选

4. 异常

4.1 Requests库的异常


异常	说明
requests.ConnectionError	网络连接错误异常，如DNS查询失败、拒绝连接等
requests.HTTPError	HTTP错误异常
requests.URLRequired	URL缺失异常
requests.TooManyRedirects	超过最大重定向次数，产生重定向异常
requests.ConnectTimeout	连接远程服务器超时异常
requests.Timeout	请求URL超时，产生超时异常

4.2 Response的异常


异常	说明
r.raise_for_status()	如果不是200，产生异常 requests.HTTPError

raise_for_status()在方法内部判断r.status_code是否等于200，不需要增加额外的if语句，该语句便于利用try‐except进行异常处理。

5. 爬取网页的通用代码框架

import requests

def getHTMLText(url):
try:
r = requests.get(url,timeout=30)
r.raise_for_status()
r.encoding = r.apparent_encoding
return r.text
except:
return '产生异常'

if __name__ == '__main__':
url = 'http://www.baidu.com'
print(getHTMLText(url))

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： python 爬虫

相关文章推荐

新的分享

章节导航

Python爬虫入门 第一章 Requests库入门

第一章 Requests库入门

1. 使用命令行安装：

2. Requests库的7个主要方法

2.1 requests.get()

2.2 requests.head()

2.3 requests.post()

2.4 requests.put()

2.5 requests.patch()

2.6 requests.delete()

2.7 理解PATCH和PUT分的区别

3. Response对象

3.1 Response对象的属性

4. 异常

4.1 Requests库的异常

4.2 Response的异常

5. 爬取网页的通用代码框架

Python爬虫入门第一章 Requests库入门