您的位置：首页 > 编程语言 > Python开发

批量下载网站图片的Python实用小工具

2016-10-22 16:44 627 查看

本文适合于熟悉Python编程且对互联网高清图片饶有兴趣的筒鞋。读完本文后，将学会如何使用Python库批量并发地抓取网页和下载图片资源。只要懂得如何安装Python库以及运行Python程序，就能使用本文给出的程序批量下载指定图片啦！

　在网上冲浪的时候，总有些“小浪花”令人喜悦。没错，小浪花就是美图啦。边浏览边下载，自然是不错的；不过，好花不常开，好景不常在，想要便捷地保存下来，一个个地另存为还是很麻烦的。能不能批量下载呢？

目标

太平洋摄影网，一个不错的摄影网站。如果你喜欢自然风光的话，不妨在上面好好饱览一顿吧。饱览一会，或许你还想打包带走呢。这并不是难事，让我们顺藤摸瓜地来尝试一番吧（懒得截图，自己打开网站观赏吧）。

首先，我们打开网址 http://dp.pconline.com.cn/list/all_t145.html ；那么，马上有N多美妙的缩略图呈现在你面前；

任意点击其中一个链接，就到了一个系列的第一张图片的页面： http://dp.pconline.com.cn/photo/3687487.html，再点击下可以到第二张图片的页面： http://dp.pconline.com.cn/photo/3687487_2.html ；图片下方点击“查看原图”，会跳转到 http://dp.pconline.com.cn/public/photo/source_photo.jsp?id=19706865&photoId=3687487 这个页面，呈现出一张美美的高清图。右键另存为，就可以保存到本地。

也许你的心已经开始痒痒啦：要是一个命令行，就能把美图尽收怀中，岂不美哉！

思路

该如何下手呢？要想用程序自动化解决问题，就得找到其中规律！规律，YES ！

只要你做过 web 开发，一定知道，在浏览器的控制台，会有页面的 html ，而 html 里会包含图片，或者是包含图片的另一个 HTML。对于上面的情况而言， http://dp.pconline.com.cn/list/all_t145.html 是一个大主题系列的入口页面，比如自然是 t145，建筑是 t292，记作 EntryHtml ；这个入口页面包含很多链接指向子的HTML，这些子 HTML 是这个大主题下的不同个性风格的摄影师拍摄的不同系列的美图，记作 SerialHtml ; 而这些 SerialHtml 又会包含一个子系列每一张图片的首 HTML，记作 picHtml ，这个 picHtml 包含一个“查看原图”链接，指向图片高清地址的链接 http://dp.pconline.com.cn/public/photo/source_photo.jsp?id=19706865&photoId=3687487 ，记作 picOriginLink ；最后，在 picOriginLink 里找到 img 元素，即高清图片的真真实地址 picOrigin。 (⊙v⊙)嗯，貌似有点绕晕了，我们来总结一下：

EntryHtml （主题入口页面） -> SerialHtml （子系列入口页面） -> picHtml （子系列图片浏览页面） -> picOriginLink （高清图片页面） -> picOrigin （高清图片的真实地址）

现在，我们要弄清楚这五级是怎么关联的。

经过查看 HTML 元素，可知：

(1) SerialHtml 元素是 EntryHtml 页面里的 class="picLink" 的 a 元素；

(2) picHtml 元素是 SerialHtml 的加序号的结果，比如 SerialHtml 是 http://dp.pconline.com.cn/photo/3687487.html，总共有 8 张，那么 picHtml = http://dp.pconline.com.cn/photo/3687487_[1-8].html ，注意到 http://dp.pconline.com.cn/photo/3687487.html 与 http://dp.pconline.com.cn/photo/3687487_1.html 是等效的，这会给编程带来方便。

(3) “查看原图” 是指向高清图片地址的页面 xxx.jsp 的链接：它是 picHtml 页面里的 class="aView aViewHD" 的 a 元素；

(4) 最后，从 xxx.jsp 元素中找出 src 为图片后缀的 img 元素即可。

那么，我们的总体思路就是：

STEP1：抓取 EntryHtml 的网页内容 entryContent ;

STEP2：解析 entryContent ，找到class="picLink" 的 a 元素列表 SerialHtmlList ；

STEP3：对于SerialHtmlList 的每一个网页 SerialHtml_i：

(1) 抓取其第一张图片的网页内容，解析出其图片总数 total ；

(2) 根据图片总数 total 并生成 total 个图片链接 picHtmlList ；

a. 对于 picHtmlList 的每一个网页，找到 class="aView aViewHD" 的 a 元素 hdLink ；

b. 抓取 hdLink 对应的网页内容，找到img元素获得最终的图片真实地址 picOrigin ；

c. 下载 picOrigin 。

注意到，一个主题系列有多页，比如首页是 EntryHtml ：http://dp.pconline.com.cn/list/all_t145.html ，第二页是 http://dp.pconline.com.cn/list/all_t145_p2.html ；首页等效于 http://dp.pconline.com.cn/list/all_t145_p1.html 这会给编程带来方便。要下载一个主题下多页的系列图片，只要在最外层再加一层循环。这就是串行版本的实现流程。

串行实现

主要库的选用：

(1) requests : 抓取网页内容；

(2) BeautifulSoup: 遍历HTML文档树，获取所需要的节点元素；

(3) multiprocessing.dummy : Python 的多进程并发库，这个是以多进程API的形式实现多线程的功能。

一点技巧：

(1) 使用装饰器来统一捕获程序中的异常，并打印错误信息方便排查；

(2) 细粒度地拆分逻辑，更易于复用、扩展和优化；

(3) 使用异步函数改善性能，使用 map 函数简洁表达；

运行环境 Python2.7 , 使用 easy_install 或 pip 安装 requests , BeautifulSoup 这两个三方库。

　　串行版本实现：

#!/usr/bin/python
#_*_encoding:utf-8_*_

import os
import re
import sys
import json
from multiprocessing import (cpu_count, Pool)
from multiprocessing.dummy import Pool as ThreadPool

import argparse
import requests
from bs4 import BeautifulSoup
import Image

ncpus = cpu_count()
saveDir = os.environ['HOME'] + '/joy/pic/test'
whitelist = ['pconline', 'zcool', 'huaban', 'taobao', 'voc']

DEFAULT_LOOPS = 1
DEFAULT_WIDTH = 800
DEFAULT_HEIGHT = 600

def isInWhiteList(url):
for d in whitelist:
if d in url:
return True
return False

def parseArgs():
description = '''This program is used to batch download pictures from specified initial url.
eg python dwloadpics_killer.py -u init_url
'''
parser = argparse.ArgumentParser(description=description)
parser.add_argument('-u','--url', help='One initial url is required', required=True)
parser.add_argument('-l','--loop', help='download url depth')
parser.add_argument('-s','--size', nargs=2, help='specify expected size that should be at least, (with,height) ')
args = parser.parse_args()
init_url = args.url
size = args.size
loops = int(args.loop)
if loops is None:
loops = DEFAULT_LOOPS
if size is None:
size = [DEFAULT_WIDTH, DEFAULT_HEIGHT]
return (init_url,loops, size)

def createDir(dirName):
if not os.path.exists(dirName):
os.makedirs(dirName)

def catchExc(func):
def _deco(*args, **kwargs):
try:
return func(*args, **kwargs)
except Exception as e:
print "error catch exception for %s (%s, %s): %s" % (func.__name__, str(*args), str(**kwargs), e)
return None
return _deco

class IoTaskThreadPool(object):
'''
thread pool for io operations
'''
def __init__(self, poolsize):
self.ioPool = ThreadPool(poolsize)

def execTasks(self, ioFunc, ioParams):
if not ioParams or len(ioParams) == 0:
return []
return self.ioPool.map(ioFunc, ioParams)

def execTasksAsync(self, ioFunc, ioParams):
if not ioParams or len(ioParams) == 0:
return []
self.ioPool.map_async(ioFunc, ioParams)

def close(self):
self.ioPool.close()

def join(self):
self.ioPool.join()

class TaskProcessPool():
'''
process pool for cpu operations or task assignment
'''
def __init__(self):
self.taskPool = Pool(processes=ncpus)

def addDownloadTask(self, entryUrls):
self.taskPool.map_async(downloadAllForAPage, entryUrls)

def close(self):
self.taskPool.close()

def join(self):
self.taskPool.join()

def getHTMLContentFromUrl(url):
'''
get html content from html url
'''
r = requests.get(url)
status = r.status_code
if status != 200:
return ''
return r.text

def batchGrapHtmlContents(urls):
'''
batch get the html contents of urls
'''
global grapHtmlPool
return grapHtmlPool.execTasks(getHTMLContentFromUrl, urls)

def getAbsLink(link):
global serverDomain

try:
href = link.attrs['href']
if href.startswith('//'):
return 'http:' + href
if href.startswith('/'):
return serverDomain + href
if href.startswith('http://'):
return href
return ''
except:
return ''

def filterLink(link):
'''
only search for pictures in websites specified in the whitelist
'''
if link == '':
return False
if not link.startswith('http://'):
return False
serverDomain = parseServerDomain(link)
if not isInWhiteList(serverDomain):
return False
return True

def filterImgLink(imgLink):
'''
The true imge addresses always ends with .jpg
'''
commonFilterPassed = filterLink(imgLink)
if commonFilterPassed:
return imgLink.endswith('.jpg')

def getTrueImgLink(imglink):
'''
get the true address of image link:
(1) the image link is http://img.zcool.cn/community/01a07057d1c2a40000018c1b5b0ae6.jpg@900w_1l_2o_100sh.jpg but the better link is http://img.zcool.cn/community/01a07057d1c2a40000018c1b5b0ae6.jpg (removing what after @)
(2) the image link is relative path /path/to/xxx.jpg
then the true link is serverDomain/path/to/xxx.jpg serverDomain is http://somedomain '''

global serverDomain
try:
href = imglink.attrs['src']
if href.startswith('/'):
href = serverDomain + href
pos = href.find('jpg@')
if pos == -1:
return href
return href[0: pos+3]
except:
return ''

def findAllLinks(htmlcontent, linktag):
'''
find html links or pic links from html by rule.
'''
soup = BeautifulSoup(htmlcontent, "lxml")
if linktag == 'a':
applylink = getAbsLink
else:
applylink = getTrueImgLink
alinks = soup.find_all(linktag)
allLinks = map(applylink, alinks)
return filter(lambda x: x!='', allLinks)

def findAllALinks(htmlcontent):
return findAllLinks(htmlcontent, 'a')

def findAllImgLinks(htmlcontent):
return findAllLinks(htmlcontent, 'img')

def flat(listOfList):
return [val for sublist in listOfList for val in sublist]

@catchExc
def downloadPic(picsrc):
'''
download pic from pic href such as http://img.pconline.com.cn/images/upload/upc/tx/photoblog/1610/21/c9/28691979_1477032141707.jpg '''

picname = picsrc.rsplit('/',1)[1]
saveFile = saveDir + '/' + picname

picr = requests.get(picsrc, stream=True)
with open(saveFile, 'wb') as f:
for chunk in picr.iter_content(chunk_size=1024):
if chunk:
f.write(chunk)
f.flush()
f.close()
return saveFile

@catchExc
def removeFileNotExpected(filename):
global size

expectedWidth = size[0]
expectedHeight = size[1]
img = Image.open(filename)
imgsize = img.size
if imgsize[0] < expectedWidth or imgsize[1] < expectedHeight:
os.remove(filename)

def downloadAndCheckPic(picsrc):
saveFile = downloadPic(picsrc)
removeFileNotExpected(saveFile)

def batchDownloadPics(imgAddresses):
global dwPicPool
dwPicPool.execTasksAsync(downloadAndCheckPic, imgAddresses)

def downloadFromUrls(urls, loops):
htmlcontents = batchGrapHtmlContents(urls)
allALinks = flat(map(findAllALinks, htmlcontents))
allALinks = filter(filterLink, allALinks)
if loops == 1:
allImgLinks = flat(map(findAllImgLinks, htmlcontents))
validImgAddresses = filter(filterImgLink, allImgLinks)
batchDownloadPics(validImgAddresses)
return allALinks

def startDownload(init_url, loops=3):
'''
if init_url -> mid_1 url -> mid_2 url -> true image address
then loops = 3 ; default loops = 3
'''
urls = [init_url]
while True:
urls = downloadFromUrls(urls, loops)
loops -= 1
if loops == 0:
break

def divideNParts(total, N):
'''
divide [0, total) into N parts:
return [(0, total/N), (total/N, 2M/N), ((N-1)*total/N, total)]
'''

each = total / N
parts = []
for index in range(N):
begin = index*each
if index == N-1:
end = total
else:
end = begin + each
parts.append((begin, end))
return parts

def parseServerDomain(url):
parts = url.split('/',3)
return parts[0] + '//' + parts[2]

if __name__ == '__main__':

(init_url,loops, size) = parseArgs()
serverDomain = parseServerDomain(init_url)

createDir(saveDir)

grapHtmlPool = IoTaskThreadPool(10)
dwPicPool = IoTaskThreadPool(10)

startDownload(init_url, loops)
dwPicPool.close()
dwPicPool.join()

View Code
　　　

小结

通过一个针对特定目标网站的批量图片下载工具的实现，从一个串行版本改造成一个并发的更加通用的版本，学到了如下经验：

(1) 将线程池、进程池、任务分配等基础组件通用化，才能在后续更省力地编写程序，不必一次次写重复代码；

　　(2) 更加通用可扩展的程序，需要更小粒度更可复用的单一微操作；

(3) 需要能够分离变量和不变量，并敏感地意识到可能的变量以及容纳的方案；

(4) 通过寻找规律，提炼规则，并将规则使用数据结构可配置化，从而使得工具更加通用；

　　(5) 通过探究本质，可以达到更加简洁有效的思路和实现；

(6) 实际上，图片网站的规则可谓千变万化，针对某个或某些网站提炼的规则对于其他网站不一定有效；如果要做成更强大通用的图片下载器，则需要对主流网站的图片存放及链接方式做一番调研，归纳出诸多规则集合，然后集中做成规则匹配引擎，甚至是更智能的图片下载工具。不过，对于个人日常使用来说，只要能顺利下载比较喜欢的网站的图片，逐步增强获取图片真实地址的规则集合，也是可以滴 ~~

本文原创，转载请注明出处，谢谢！ :)

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航