您的位置：首页 > 编程语言 > Python开发

Python3爬虫入门之贴吧图片批量获取

2016-05-03 16:05 726 查看

因为百度贴吧的网页结构较为简单，且无需登录即可读取帖子内容，所以批量获取贴吧图片非常适合用来作为Python爬虫入门的练手项目。

本文所用的Python版本为Python3,用到的主要模块如下：

比urllib友好的requests
bs4（即BeautifulSoup）
正则表达式模块re
os模块

程序所能实现的功能是：获取单页面帖子或多页面帖子中的图片，并保存到本机文件系统当中。

首先要实现的是获取帖子的总页数。tieba_url根据自己的情况进行修改，header可以利用fiddler抓包进行获取。

header = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:45.0) Gecko/20100101 Firefox/45.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3',
'Accept-Encoding': 'gzip, deflate',
'Connection': 'keep-alive'
}
tieba_url = 'http://tieba.baidu.com/p/2491624899'
response = requests.get(tieba_url, headers=header).content.decode('UTF-8')
pattern = re.compile(r'共<span class="red">([0-9]*)</span>页')
max_page = pattern.findall(response)[0]

这里简单用了一个正则表达式来获取帖子的页数，（[0-9]*）的意思就是把这个数字提取出来，关于正则表达式的用法网上有很多资料，本文不再赘述。

这样就可以区分帖子是单页还是多页了，多页的帖子每一页跟单页的帖子处理方式基本相同，但是要重复几次，所以把处理每一页的程序写成一个函数。

def get_image(url, page):
html = requests.get(url, headers=header).content.decode('UTF-8')
soup = BeautifulSoup(html)
pics = soup.find_all("img", {"class": "BDE_Image", "pic_type": "0"})
for i in range(0, len(pics)):
img_url = pics[i]['src']
img = requests.get(img_url).content
save_path = 'D:/tmp/tieba/%s/' % page
if os.path.exists(save_path):
pass
else:
os.mkdir(save_path)
with open(save_path + '%s.jpg' % str(i), 'wb') as f:
f.write(img)
print(img_url + 'saved')</span>

pics = soup.find_all("img", {"class": "BDE_Image", "pic_type": "0"})

这行语句实现的功能是找到所有帖子中的图片（签名档以及其他图片不在此列）,这个的依据是贴吧的html页面，帖子当中的图片都是这样一个格式

<img pic_type="0" class="BDE_Image" src="http://imgsrc.baidu.com/forum/*.jpg" pic_ext="bmp"  height="680" width="485">

后面语句的功能是将图片保存到本地，保存路径如果不存在将创建该路径。函数的第一个参数是帖子当前页的url，第二个参数是当前页的页码（用来区分保存路径）。
程序的完整代码如下。

import requests
import re
from bs4 import BeautifulSoup
import os

def get_image(url, page):
html = requests.get(url, headers=header).content.decode('UTF-8')
soup = BeautifulSoup(html)
pics = soup.find_all("img", {"class": "BDE_Image", "pic_type": "0"})
for i in range(0, len(pics)):
img_url = pics[i]['src']
img = requests.get(img_url).content
save_path = 'D:/tmp/tieba/%s/' % page
if os.path.exists(save_path):
pass
else:
os.mkdir(save_path)
with open(save_path + '%s.jpg' % str(i), 'wb') as f:
f.write(img)
print(img_url + ' saved')

header = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:45.0) Gecko/20100101 Firefox/45.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3',
'Accept-Encoding': 'gzip, deflate',
'Connection': 'keep-alive'
}
tieba_url = 'http://tieba.baidu.com/p/2491624899'
response = requests.get(tieba_url, headers=header).content.decode('UTF-8')
pattern = re.compile(r'共<span class="red">([0-9]*)</span>页')
max_page = pattern.findall(response)[0]
if max_page == '1':
page_url = tieba_url
get_image(page_url,1)
else:
for i in range(1, int(max_page)):
page_url = tieba_url + '?pn=' + str(i)
get_image(page_url, str(i))

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： 爬虫 Python3 贴吧图片批量下载

相关文章推荐

新的分享

章节导航