您的位置:首页 > 编程语言 > Python开发

Python爬虫入门

2016-07-14 20:28 363 查看
构造cookie

cookie = http.cookiejar.MozillaCookieJar('cookie.txt')
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)


保存和读取cookie

//保存cookie
cookie = http.cookiejar.MozillaCookieJar('cookie.txt')
cookie.save(ignore_discard = True, ignore_expires = True)
//读取cookie
cookie = http.cookiejar.MozillaCookieJar()
cookie.load('cookie.txt', ignore_discard = True, ignore_expires = True)


构造头部信息

headers = {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36'
}
header = []
for key, value in headers.items():
elem = (key, value)
header.append(elem)
opener.addheaders = header


构造post信息

postRowdata = {
'id':'*************',
'pwd':'************',
'xdvfb':xdvfb
}
postData = urllib.parse.urlencode(postRowdata).encode()


访问网站

result = opener.open(postUrl, postData)
result = opener.open(postUrl)


解压信息

def ungzip(data):
try:
# 尝试解压
print('正在解压.....')
data = gzip.decompress(data)
print('解压完毕!')
except:
print('未经压缩, 无需解压\n')
return data


保存页面

page = result.read()
page = ungzip(page)
open('logined.html', 'wb').write(page)


下载文件

urllib.request.urlretrieve(imgurl,'file.txt')
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  python cookie 爬虫