您的位置:首页 > 编程语言 > Python开发

Python爬虫入门:使用urllib.request中的Handler类来构建Opener

2019-07-22 10:36 495 查看
版权声明:本文为博主原创文章,遵循 CC 4.0 by-sa 版权协议,转载请附上原文出处链接和本声明。 本文链接:https://blog.csdn.net/qq_36365528/article/details/96831245

Python爬虫入门:使用urllib.request中的Handler类来构建Opener

  • 读取并利用Cookies
  • 概述

    在请求的时候,经常会遇到登录验证、处理Cookies以及设置代理等,这时候就需要Handler登场了。urllib.request模块的BaseHandler是所有Handler类的父类,提供了最基本的方法,如default_open(),protocol_request()等。
    基本的Hander类如下:

    作用
    HTTPDefaultErrorHandler 处理HTTP响应错误,会抛出HTTPError异常
    HTTPRedirectHandler 处理重定向
    HTTPCookieProcessor 用于处理Cookies
    ProxyHandler 用于设置代理,默认代理为空
    HTTPPasswordMgr 用于管理密码,维护用户名和密码的表
    HTTPBasicAuthHandler 用于管理认证

    验证

    from urllib.request import HTTPPasswordMgrWithDefaultRealm, HTTPBasicAuthHandler, build_opener
    from urllib.error import URLError
    
    username = 'username'
    password = 'password'
    url = 'https://www.baidu.com'
    
    p = HTTPPasswordMgrWithDefaultRealm()
    p.add_password(None, url, username, password)
    auth_handler = HTTPBasicAuthHandler(p)
    opener = build_opener(auth_handler)
    
    try:
    result = opener.open(url)
    html = result.read().decode('utf-8')
    print(html)
    except URLError as e:
    print(e.reason)

    代理

    from urllib.error import URLError
    from urllib.request import ProxyHandler, build_opener
    
    proxy_handler = ProxyHandler({
    'http': '113.128.8.9:9999',
    'https': '113.128.8.9:9999'
    })
    opener = build_opener(proxy_handler)
    try:
    response = opener.open('https://www.baidu.com')
    print(response.read().decode('utf-8'))
    except URLError as e:
    print(e.reason)

    Cookies

    获取Cookies

    import http.cookiejar, urllib.request
    
    cookie = http.cookiejar.CookieJar()
    handler = urllib.request.HTTPCookieProcessor(cookie)
    opener = urllib.request.build_opener(handler)
    response = opener.open('http://www.baidu.com')
    for item in cookie:
    print(item.name+" = "+item.value)

    存储获取的Cookies

    保存为普通格式

    import http.cookiejar, urllib.request
    
    filename = 'cookies.txt'
    cookie = http.cookiejar.MozillaCookieJar(filename)
    handler = urllib.request.HTTPCookieProcessor(cookie)
    opener = urllib.request.build_opener(handler)
    response = opener.open('http://www.baidu.com')
    cookie.save(ignore_discard=True, ignore_expires=True)

    保存为LWP格式

    import http.cookiejar, urllib.request
    
    filename = 'cookies.txt'
    cookie = http.cookiejar.LWPCookieJar(filename)
    handler = urllib.request.HTTPCookieProcessor(cookie)
    opener = urllib.request.build_opener(handler)
    response = opener.open('http://www.baidu.com')
    cookie.save(ignore_discard=True, ignore_expires=True)

    读取并利用Cookies

    import http.cookiejar, urllib.request
    
    cookie = http.cookiejar.LWPCookieJar()
    cookie.load('cookies.txt', ignore_discard=True, ignore_expires=True)
    handler = urllib.request.HTTPCookieProcessor(cookie)
    opener = urllib.request.build_opener(handler)
    response = opener.open('http://www.baidu.com')
    print(response.read().decode('utf-8'))
    内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
    标签: