您的位置:首页 > 编程语言 > Python开发

python自动下载人人所有好友的相册

2011-08-08 22:13 417 查看
作者:华亮

转载请说明出处:http://blog.csdn.net/cedricporter



昨天下午写的自动抓取自己人人相册的python代码,用途貌似只有备份一下自己的相册。于是今天修改了专门针对人人网的爬虫,增加了自动抓取所有好友的功能,然后去他们的空间,把他(她)们的相册都下载回来(比较适合较多美女朋友的同学们..)...
昨天发的文章有很多标签结果太长了,于是很悲剧地,修改的时候腾讯居然不给提交,XXXXX(省略一万字...)
人人网是个很类似facebook的东东....为什么会很类似,因为中国特色....
转入正题,因为怕以后忘了,所以写下来记录一下...
好,第一点是名词解释。
爬虫是神马?
根据百度百科有: “网络爬虫(又被称为网页蜘蛛,网络机器人,在FOAF社区中间,更经常的称为网页追逐者),是一种按照一定的规则,自动的抓取万维网信息的程序或者脚本。.......传统爬虫从一个或若干初始网页的URL开始,获得初始网页上的URL,在抓取网页的过程中,不断从当前页面上抽取新的URL放入队列,直到满足系统的一定停止条件。”
偶针对人人做了一些特化(换句话说拿到其他网站就没用了),人人网要访问首先得有个帐号,也就是说要先登录,然后服务器就可以根据session或cookie来判断你在其他页面的登录情况,而对人人cookie就好了。当然,我们在一个浏览器登录,在另一个浏览器也可能还得要再登录一下,因为一般情况下他们不共享cookie,除非专门去读某个浏览器的cookie。于是爬虫要爬人人,首先要登录.....然后保存cookie。
浏览器与服务器之间通讯主要都是Http协议,方法主要有GET和POST,(据《深入理解计算机系统》说,GET方法占了99%的HTTP请求。),GET方法主要向服务器发送比较短的数据,主要将参数写到URL里面,而POST方法则可以发送比较长的数据,例如发这篇文章的话,则是用了POST。想我们可以用"Telnet
www.google.com 80",然后键入"Get /"就可以可以收到和我们在浏览器打上"http://www.google.com/"同样的东西。爬虫也一样,就是不断地GET,POST……
要抓取所有好友的所有可见的相册有两种方法,一种是人工一个好友一个好友一个相册一个相册地下,另一种就是就给计算机让它自己去爬....因为我比较懒,所以选择第二种方法。
又到了“要怎么怎么样,首先怎么怎么样”的句式了~
要获取所有好友,可以在登录的情况下访问http://friend.renren.com/myfriendlistx.do,如果有用浏览器登录的话,好友会被javascript分成很多页显示。在网页的某段javascript中有个变量叫friends,保存所有好友的信息,里面都是{"id":254905709,"vip":false,"selected":true,"mo":true,"name":"\u5b89\u8feaAndy","head":"http:\/\/hdn.xnimg.cn\/photos\/hdn321\/20110612\/1600\/h_tiny_zFLc_715e000281932f76.jpg","groups":["\u534e\u5357\u7406\u5de5\u5927\u5b66"]}这种元组,从这里,我们可以获取所有好友的id。
要获取某个人的所有相册,可以访问http://www.renren.com/profile.do?id=(某人的id)&v=photo_ajax&undefined,这个是怎么找出来的呢?我们登录一个人的主页时,然后点击相册,这个页面并没有刷新,只是由AJAX替换了页面的一部分,它就是去Get那个路径,就返回了网页的一部分代码过来,替换掉现在的。所以我们也可以去Get那个路径,就可以获得包含所有相册id的页面。
要获取一个相册里面的所有照片,这个要靠人人的一个Bug了,很无意发现的,你可以打开别人相册的排序照片的页面。在排序的页面,一个相册所有的照片都列出来了,通过正则表达式,我们就可以拿到每张照片的id。排序的页面为http://photo.renren.com/photo/(某人的id)/album-(相册id)/reorder。
经过了三句“要怎么怎么样,首先怎么怎么样”,我们拿到了所有好友的id,所有好友的所有相册的id,和所有好友的所有相册的所有照片的id。为什么都是id呢?这个个人觉得用一个整数作为数据库元组的主码,性能会高些,而且对于一个32位整数,只占4字节,就可以标识4294967296个东西了。加上在客户与服务器之间传送id也方便。
拥有这些id我们可以做什么,目前什么都做不了,我们访问http://photo.renren.com/photo/(某人的id)/photo-(相片id)就可以在网页中代码中发现AJAX返回的一段代码代码中有一句"largeurl":"http:\/\/fmn.rrimg.com\/fmn049\/20110621\/1520\/p_large_S5jA_37eb000165dc5c3f.jpg",这就是一张照片的真正地址了,然后我们把里面的"\"给删掉就可以下载了。

好,于是我们就可以这样写出一个残缺不全的爬虫了..........对于人人的新鲜事,可以把一个页面的url抓出来筛选后放到一个优先队列里,再从优先队列里选一个最优的进入,重复上一步,直到队列为空或者其他情况....呃,传说中的中文伪代码....

更多代码见:http://code.google.com/p/stupidet/

程序在Ubuntu 11.04和windows 7 x64下测试正常,在windows下请用Idle打开运行。

主程序:

# -*-coding:utf-8-*-
# Filename:main.py
# 作者:华亮
#

from Renren import SuperRenren
import time

def main():
renren = SuperRenren()
if renren.Create('人人帐号', '人人密码'):
#renren.PostMsg(time.asctime())
#renren.PostGroupMsg('387635422', '%s' % time.asctime())
#renren.DownloadAlbum('333982368', 'sss')
renren.DownloadAllFriendsAlbums(threadnumber = 1)

if __name__ == '__main__':
main()


人人库:

# -*- coding:utf-8 -*-
# Filename:Renren.py
# 作者:华亮
#

from HTMLParser import HTMLParser
from Queue import Empty
from Queue import Queue
from re import match
from sys import exit
from urllib import urlencode
import os
import re
import socket
import threading
import time
import urllib
import urllib2
import shelve

# 提供给输出的互斥对象
GlobalPrintMutex = threading.Lock()
# 提供输出config.cfg的互斥对象
GlobalWriteConfigMutex = threading.Lock()
# 提供保存用户最后更新的互斥对象
GlobalShelveMutex = threading.Lock()

# 根据平台不同选择不同的路径分割符
Delimiter = '/' if os.name == 'posix' else '\\'

ConfigFilename = 'config.cfg'           # 每个相册的已经下载的图片id
LastUpdatedFileName = 'lastupdated.cfg' # 所有人的最后更新时间
UpdateThreashold = 10 * 60                 # 更新时间

# 多核情况下的输出
def MutexPrint(content):
GlobalPrintMutex.acquire()
print content
GlobalPrintMutex.release()

def MutexWriteFile(file, content):
GlobalWriteConfigMutex.acquire()
file.write(content)
file.flush()
GlobalWriteConfigMutex.release()

# 字符串形式的unicode转成真正的字符
def Str2Uni(str):
import re
pat = re.compile(r'\\u(\w{4})')
lst = pat.findall(str)
lst.insert(0, '')
return reduce(lambda x,y: x + unichr(int(y, 16)), lst)

#------------------------------------------------------------------------------
# 下载文件的下载者
class Downloader(threading.Thread):
def __init__(self, urlQueue, failedQueue, file=None):
threading.Thread.__init__(self)
self.queue = urlQueue
self.failedQueue = failedQueue
self.file = file

def run(self):
try:
while not self.queue.empty():
pid, url, filename = self.queue.get()
isfile = os.path.isfile(filename.decode('utf-8'))
#print filename.decode('utf-8')
MutexPrint(("\tDownloading %s" if not isfile else "\tExists %s") % filename.decode('utf-8'))
if not isfile: urllib.urlretrieve(url, filename.decode('utf-8'))
MutexWriteFile(self.file, pid + '\r\n')
except Empty:
pass
except Exception, e:
self.failedQueue.put(pid)
MutexPrint('\tError occured when downloading photo which id = %s' % pid)
MutexPrint(e)

#------------------------------------------------------------------------------
# 人人相册的解析
class RenrenAlbums(HTMLParser):
in_key_div = False
in_ul = False
in_li = False
in_a = False
albumsUrl = []

def handle_starttag(self, tag, attrs):
attrs = dict(attrs)
if tag == 'div' and 'class' in attrs and attrs['class'] == 'big-album album-list clearfix':
self.in_key_div = True
elif self.in_key_div:
if tag == 'ul':
self.in_ul = True
elif self.in_ul and tag == 'li':
self.in_li = True
if self.in_li and tag == 'a' and 'href' in attrs:
self.in_a = True
self.albumsUrl.append(attrs['href'])

def handle_data(self, data):
pass

def handle_endtag(self, tag):
if self.in_key_div and tag == 'div':
self.in_key_div = False
elif self.in_ul and tag == 'ul':
self.in_ul = False
elif self.in_li and tag == 'li':
self.in_li = False
elif self.in_a and tag == 'a':
self.in_a = False

class RenrenRequester:
'''
人人访问器
'''
LoginUrl = 'http://www.renren.com/PLogin.do'
# 输入用户和密码的元组
def Create(self, username, password):
loginData = {'email':username,
'password':password,
'origURL':'',
'formName':'',
'method':'',
'isplogin':'true',
'submit':'登录'}
postData = urlencode(loginData)
cookieFile = urllib2.HTTPCookieProcessor()
self.opener = urllib2.build_opener(cookieFile)
req = urllib2.Request(self.LoginUrl, postData)
result = self.opener.open(req)
if not (result.geturl() == 'http://www.renren.com/home' or 'http://guide.renren.com/guide'):
return False

rawHtml = result.read()
# 获取用户id
useridPattern = re.compile(r'user : {"id" : (\d+?)}')
self.userid = useridPattern.search(rawHtml).group(1)

# 查找requestToken
pos = rawHtml.find("get_check:'")
if pos == -1: return False
rawHtml = rawHtml[pos + 11:]
token = match('-\d+', rawHtml)
if token is None:
token = match('\d+', rawHtml)
if token is None: return False
self.requestToken = token.group()
self.__isLogin = True
return self.__isLogin

def GetRequestToken(self):
return self.requestToken

def GetUserId(self):
return self.userid

def Request(self, url, data = None):
if self.__isLogin:
if data:
encodeData = urlencode(data)
request = urllib2.Request(url, encodeData)
else:
request = urllib2.Request(url)
result = self.opener.open(request)
return result
else:
return None

class RenrenPostMsg:
'''
RenrenPostMsg
发布人人状态
'''
newStatusUrl = 'http://status.renren.com/doing/updateNew.do'

def Handle(self, requester, param):
requestToken, msg = param

statusData = {'content':msg,
'isAtHome':'1',
'requestToken':requestToken}
postStatusData = urlencode(statusData)

requester.Request(self.newStatusUrl, statusData)

return True

class RenrenPostGroupMsg:
'''
RenrenPostGroupMsg
发布人人小组状态
'''
newGroupStatusUrl = 'http://qun.renren.com/qun/ugc/create/status'

def Handle(self, requester, param):
requestToken, groupId, msg = param
statusData = {'minigroupId':groupId,
'content':msg,
'requestToken':requestToken}
requester.Request(self.newGroupStatusUrl, statusData)

class RenrenFriendList:
'''
RenrenFriendList
人人好友列表
'''
def Handler(self, requester, param):
friendUrl = 'http://friend.renren.com/myfriendlistx.do'
rawHtml = requester.Request(friendUrl).read()

friendInfoPack = re.search(r'var friends=\[(.*?)\];', rawHtml).group(1)
friendIdPattern = re.compile(r'"id":(\d+).*?"name":"(.*?)"')
friendIdList = []
for id, name in friendIdPattern.findall(friendInfoPack):
friendIdList.append((id, Str2Uni(name)))

return friendIdList

class RenrenAlbumDownloader:
'''
AlbumDownloader
相册下载者,记录已经下载的照片id到config.cfg,不会重新下载
'''
threadNumber = 10    # 下载线程数

def Handler(self, requester, param):
self.requester = requester
userid, path = param
self.__DownloadOneAlbum(userid, path)

# 解析html获取人名
def __GetPeopleNameFromHtml(self, rawHtml):
peopleNamePattern = re.compile(r'<h2>(.*?)<span>')
# 取得人名
peopleName = peopleNamePattern.search(rawHtml).group(1).strip()
return peopleName

def __GetAlbumsNameFromHtml(self, rawHtml):
albumUrlPattern = re.compile(r'<a href="(.*?)" stats="album_album"><img.*?/>(.*?)</a>')
albums = []
# 把相册路径定向到排序页面,就可以在那个页面获得该相册下所有的相片的id
for album_url, album_name in albumUrlPattern.findall(rawHtml):
albums.append((album_name.strip(), album_url + '/reorder'))
return albums

def __GetAlbumPhotos(self, userid, albumUrl):
# 匹配的正则表达式
# 照片id
pidPattern = re.compile(r'<li pid="(\d+)".*?>.*?</li>', re.S)
# 访问所有包含所有相册的页面
result = self.requester.Request(albumUrl)
rawHtml = result.read()
photohtmlurl = []   # 每张照片的页面
for pid in pidPattern.findall(rawHtml):
photohtmlurl.append((pid, 'http://photo.renren.com/photo/%s/photo-%s' % (userid, pid)))

return photohtmlurl

def __GetRealPhotoUrls(self, photohtmlurl):
# 访问每个相册,获取所有照片,并修正相片的url
# 照片地址
imgPattern = re.compile(r'"largeurl":"(.*?)"')
imgUrl = [] # id与真实照片的url
for pid, url in photohtmlurl:
result = self.requester.Request(url)
rawHtml = result.read()
for img in imgPattern.findall(rawHtml):
imgUrl.append((pid, img.replace('\\', '')))
break

return imgUrl

def __DownloadAlbum(self, savepath, album_name, imgUrl, file):
# 下载相册所有图片
# 将下载文件压入队列
queue = Queue()
failedQueue = Queue()
for pid, url in imgUrl:
imgname = url.split('/')[-1]
queue.put((pid, url, savepath + Delimiter + imgname))
# 启动多线程下载
threads = []
for i in range(self.threadNumber):
downloader = Downloader(queue, failedQueue, file)
threads.append(downloader)
downloader.start()
# 等待所有线程完成
for t in threads:
t.join()
# 返回相片队列
return failedQueue

# 下载某人的相册
def __DownloadOneAlbum(self, userid, path='albums'):
#if not self.__isLogin: return
if os.path.exists(path.decode('utf-8')) == False: os.mkdir(path.decode('utf-8'))

albumsUrl = 'http://www.renren.com/profile.do?id=%s&v=photo_ajax&undefined' % userid

try:
# 取出相册和路径
result = self.requester.Request(albumsUrl)
rawHtml = result.read()
# 取得人名
peopleName = self.__GetPeopleNameFromHtml(rawHtml).strip()
albums = self.__GetAlbumsNameFromHtml(rawHtml)

# 根据人名建文件夹
path += Delimiter + peopleName
if os.path.exists(path.decode('utf-8')) == False: os.mkdir(path.decode('utf-8'))

# 开始进入相册下载
MutexPrint('Enter %s' % peopleName.decode('utf-8'))
for album_name, albumUrl in albums:
MutexPrint('Downloading Album: %s' % album_name.decode('utf-8'))
# 获取该相册下照片id和照片地址的表
photohtmlurl = self.__GetAlbumPhotos(userid, albumUrl)

# 按相册名建文件夹
album_name = album_name.replace('\\', '')  # 消去特殊符号
album_name = album_name.replace('/', '')
savepath = path + Delimiter + album_name
if os.path.exists(savepath.decode('utf-8')) == False: os.mkdir(savepath.decode('utf-8'))

#
newDownloadIdSet = set()
finishedIdSet = set()
totalIdSet = set()
for pid, url in photohtmlurl:
totalIdSet.add(pid)

configFile = savepath + Delimiter + ConfigFilename
if os.path.isfile(configFile.decode('utf-8')):
# 读取已经完成的照片以免重复访问获取大图地址的页面
file = open(configFile.decode('utf-8'), 'r')
photoIdMap = []
for line in file.readlines():
pid = line.strip()
photoIdMap.append(pid)
file.close()
finishedIdSet = set(photoIdMap)

newDownloadIdSet = totalIdSet - finishedIdSet

newDownloadPhotoHtmlUrl = ((pid, url) for pid, url in photohtmlurl if pid in newDownloadIdSet)

imgUrl = self.__GetRealPhotoUrls(newDownloadPhotoHtmlUrl)
#imgUrl.sort()
#imgUrl = list(set(imgUrl))

#                for id, url in imgUrl:
#                    print id, url

# 下载照片
try:
file = open(configFile.decode('utf-8'), 'w')
for id in finishedIdSet:
file.write(id + '\r\n')
file.flush()

failedQueue = self.__DownloadAlbum(savepath, album_name, imgUrl, file)

except Exception, e:
print 'Error when downloading.', e
finally:
# 取出下载失败的的照片的id
while not failedQueue.empty():
totalIdSet.remove(failedQueue.get())
file.close()
except AttributeError, e:
raise
except Exception, e:
print 'Error! Please contact QQ: 414112390'
print e

class AutoRenrenDownloader:
'''
AutoRenrenDownloader
自动下载所有好友相册,具有断点续传功能,一次下载为完成,第二次会接着下
'''
def handler(self, requester, param):
self.requester = requester
path, threadnumber = param
self.__DownloadFriendsAlbums(path, threadnumber)

#------------------------------------------------------------------------------
# 好友相册下载者
class FriendDownloader(threading.Thread):
def __init__(self, requester, queue, file):
threading.Thread.__init__(self)
self.file = file
self.requester = requester
self.queue = queue

def run(self):
try:
while not self.queue.empty():
id, path = self.queue.get()
downloader = RenrenAlbumDownloader()
downloader.Handler(self.requester, (id, path))
GlobalShelveMutex.acquire()
self.file['TaskList'].remove(id)
GlobalShelveMutex.release()
except Empty:
pass
except AttributeError, e:
print '有可能已经被人人网认为访问了100个好友,请访问人人网的任意好友的主页输入验证码'
#print e
except ValueError, e:
print id
print e

def __DownloadFriendsAlbums(self, path='albums', threadnumber=10):
if not os.path.exists(path.decode('utf-8')): os.mkdir(path.decode('utf-8'))

friendsList = RenrenFriendList().Handler(self.requester, None)

db = shelve.open(LastUpdatedFileName, writeback = True)
if not db.has_key('TaskList'): db['TaskList'] = []
if len(db['TaskList']) == 0:
db['TaskList'] = [id for id, realName in friendsList]

updateList = db['TaskList']

i = 1
print "此次需要更新如下:"
# 获取好友列表
queue = Queue()
for id in updateList:
print "%s:\t%s\t" % (i, id),
print dict(friendsList)[id]
i += 1
queue.put((id, path))

# 下载好友
DownloadersList = []
failedQueue = Queue()
try:
for i in range(threadnumber):
friendDownloader = self.FriendDownloader(self.requester, queue, db)
friendDownloader.start()
DownloadersList.append(friendDownloader)
for downloader in DownloadersList:
downloader.join()
except Exception, e:
print '-' * 100 + "\nPlease Goto Renren.com\n" + '-' * 100
print e
finally:
db.close()

class SuperRenren:
'''
SuperRenren
人人控制器
'''
# 创建
def Create(self, username, password):
self.requester = RenrenRequester()
if self.requester.Create(username, password):
self.userid = self.requester.userid
self.requestToken = self.requester.requestToken
return True
return False
# 发送个人状态
def PostMsg(self, msg):
poster = RenrenPostMsg()
poster.Handle(self.requester, (self.requestToken, msg))
# 发送小组状态
def PostGroupMsg(self, groupId, msg):
poster = RenrenPostGroupMsg()
poster.Handle(self.requester, (self.requestToken, groupId, msg))
# 下载相册
def DownloadAlbum(self, userId, path = 'albums'):
downloader = RenrenAlbumDownloader()
downloader.Handler(self.requester, (userId, path))
# 自动下载所有好友相册
def DownloadAllFriendsAlbums(self, path = 'albums', threadnumber = 10):
downloader = AutoRenrenDownloader()
downloader.handler(self.requester, (path, threadnumber))
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: