您的位置:首页 > 编程语言 > Python开发

使用python 编写 抓取内涵段子动态图的简单爬虫

2015-04-23 21:39 786 查看
前段时间在浏览知乎的时候发现了一个关于python编写爬虫的帖子,下面是帖子的链接
www.zhihu.com/question/20899988

所以就想到了使用python也来试试爬取一些东西,本打算是根据关键词爬取百度图片的图片并下载,但是过程中遇到了阻碍,暂时停止了。然后去内涵段子的页面结构发现比较简

单一点,然后就实现了一个下图爬虫。

我编写这个程序时是参考的知乎里面帖子中的这个博主的相关博客 blog.csdn.net/pleasecallmewhy/article/details/8929576

编写这个程序主要分为下面的几个步骤:

1.分析内涵社区的页面结构

2.使用正则表达式找出待下载的url

3.下载这些图片

首先是第一步,这也是比较关键的一步,如果页面分析的不正确,那么后面的步骤也就无法下手了。

1.打开内涵段子的囧图页面 http://neihanshequ.com/pic/
我们会看到下面的页面

在这个页面下就有我们想要的一些搞笑图片,但是我们首先需要的就是获得这个这个页面的html文件,这里我用到了python的urllib这个库,代码如下

def get_html(url):
print "---------------now get html from url :" + url + "----------"

send_headers = {
'Host':'neihanshequ.com',
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:37.0) Gecko/20100101 Firefox/37.0',
'Cookie':"pksrqup=1; csrftoken=237f4451075fe45cef3a4f5449f70658; tt_webid=3379513254; uuid=\"w:33266c46f0cc4fa6944c073b1b1bccea\"",
'Connection':'keep-alive'
}

req = urllib2.Request(url ,headers=send_headers)
try:

response = urllib2.urlopen(req ,timeout = 100)
html = response.read()
return html
except urllib2.HTTPError, e:

print 'The server couldn\'t fulfill the request.'

print 'Error code: ', e.code

except urllib2.URLError, e:

print 'We failed to reach a server.'

print 'Reason: ', e.reason

else:
print 'No exception was raised.'


需要使用urllib 模拟发送的信息使用火狐的Firebug插件就可以看到,然后复制头信息出来,填到上面的header里面去就可以了。这里面的Cooiker需要添加,不添加会获取不到html文件,具体的urlib的使用介绍参见上面那位博主的博客,讲的很清楚。

现在html文件是获取到了,我们来观察一下这个文件,这个html文件结构还是比较清晰的。

每一个帖子都是由一个div组成,然后对于标题,图片和评论又各是一个div

在class = content-wrapper的div里面我们找到了这句话

这个data-text 就是囧图的配字,data-pic就是囧图的地址,那么我们的工作来了,就是获取这里面所有的data-pic和data-text(之后可以作为图片的名称)

解析这个html中的所有这两个字段,需要用到python的正则表达式,我们这里用到的非常简单,我是模仿得到的,具体的re教程去上面的博主那也可以获得

下面是我的re解析代码

这样就可以根据我刚才获得html文件解析出来所有的图片的地址了,然后下面就可以下载了,下载使用到了urllib相关的函数

-----------------截止上面你就可以下载几十张图片了

为什么只是几十张图片呢?

原因是我们刚才获取的只是首页面的html文件,那么更多的html文件怎么获得呢?

我们注意到在页面的下端有一个加载更多的按钮吧,点击它之后就可以获得图片了。

同样我们使用firebug 来抓一下包。

打开这个Get请求和结果

请求: http://neihanshequ.com/pic/?is_json=1&max_time=1429794628
响应: 我们在浏览器里面输入这个请求地址可以得到一个json响应

逐步展开json就可以获得

在large_image下面就有我们需要的啦。。

仔细观察获取到的json响应,你会发现这里面有一个min_time字段,这个字段是一个unix时间戳。而这个min_time正好就是这个下一个请求的max_time

如此循环就可以获取到所有的图片啦!!

去第一次获取的html文件同样可以找到一个

那么我们的任务基本就是不断解析json文件并下载了

下面是我的第一个版本的源代码

# -*- coding: utf-8 -*-

import urllib2
import urllib
import re
import thread
import time
import os
import random
import json

#内涵段子抓取类
class neiHanSpider :
def  __init__(self):
self.primer_url = 'http://neihanshequ.com/pic/'
#点击加载更多之后请求的url
self.base_url   = 'http://neihanshequ.com/pic/?is_json=1&max_time='

def Start(self):
#首先获取第一个页面的html数据,并分析其中的data-pic和max_time
primer_html = self.__getHtml(self.primer_url)
data_pic   = self.__getDataPic(primer_html)
max_time   = self.__getMaxTime(primer_html)
#download pic
self.__downloadPic(data_pic)
count = 0
#下面开始下载点击更多之后的图片
while max_time:
count = count + 1
print "=--------------------THIS IS THE " + str(count) + " Json Data  Time : " + str(max_time) + "--------------------"
url = self.base_url + str(max_time)
json_data = self.__getHtml(url)
json_ret  = self.__parseJson(json_data)
max_time =  json_ret['max_time']
print max_time
image_url = json_ret['image_url']
image_content = json_ret['image_content']
self.__downloadPic(image_url,image_content)

#python 以两个下划线开始的为私有函数
#尝试5次

#解析json,并获取json中的数据
def __parseJson(self,json_data):
print "------This is parse_json --------"
dct = json.loads(json_data)
image_content = []
image_url = []
max_time   = ""
try :
max_time = dct['data']['max_time']
data = dct['data']['data']
for item in data:

content = item['group']['content']
url     = item['group']['large_image']['url_list'][0]['url']
image_content.append(content)
image_url.append(url)

ret = {}
ret['image_content'] = image_content
ret['image_url']    =  image_url
ret['max_time']   = max_time
return ret
except :
print "json_parse error"

#定义下载图片函数
def __downloadPic(self,imageAddressList,contentList = []):
print "---download------"
contentExist = len(contentList)
count = 0
for image in imageAddressList :
print image
count = count + 1
randTail = str(random.randint(0,30000000))
try :
#tail =  contentExist ? contentList[count - 1] : randTail ;
if contentExist :
tail = contentList[count - 1]
else :
tail = randTail
fullPath = "C:\\Users\\Administrator\\Desktop\\python\\" + tail + ".jpg"
urllib.urlretrieve(image , fullPath)
except :
failedMsg = "第" + str(count) + "张下载失败,URL: " + str(image) + ""
print failedMsg
pass

def __getDataPic(self,html):
re_str = r'data-pic="([^"]*)"'
data_pic = self.__getDataByRe(html,re_str)
return data_pic

def __getMaxTime(self,html):
re_str = r'max_time: \'([\d]*)\''
max_time = self.__getDataByRe(html,re_str)
return max_time

def __getDataByRe(self,text,re_str):
pattern = re.compile(re_str)
ret = pattern.findall(text)
return ret

def __getHtml(self,url):
print "GET HTML********"
count = 0
while count < 5:
count = count + 1
print str(count) + " times ,try download html"
html = self.__getDataByUrl(url)
if not html:
continue;
else:
return html
def __getDataByUrl(self,url):
print "---------------now get html from url :" + url + "----------"
send_headers = {
'Host':'neihanshequ.com',
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:37.0) Gecko/20100101 Firefox/37.0',
'Cookie':"pksrqup=1; csrftoken=237f4451075fe45cef3a4f5449f70658; tt_webid=3379513254; uuid=\"w:33266c46f0cc4fa6944c073b1b1bccea\"",
'Connection':'keep-alive'
}
req = urllib2.Request(url ,headers=send_headers)
try:
response = urllib2.urlopen(req ,timeout = 100)
html = response.read()
return html
except urllib2.HTTPError, e:
print 'The server couldn\'t fulfill the request.'
print 'Error code: ', e.code
except urllib2.URLError, e:
print 'We failed to reach a server.'
print 'Reason: ', e.reason
else:
print 'No exception was raised.'

#------------------------------------------程序入口处------------------------------

mySpider = neiHanSpider()
mySpider.Start()


之后我又尝试了一个多线程版本

# -*- coding: utf-8 -*-

import urllib2
import urllib
import re
import threading
import time
import os
import random
import json

#内涵段子抓取类
class neiHanSpider :
def  __init__(self ):
self.primer_url = 'http://neihanshequ.com/pic/'
#点击加载更多之后请求的url
self.base_url   = 'http://neihanshequ.com/pic/?is_json=1&max_time='

def Start(self):
#首先获取第一个页面的html数据,并分析其中的data-pic和max_time
primer_html = self.__getHtml(self.primer_url)
data_pic   = self.__getDataPic(primer_html)
max_time   = self.__getMaxTime(primer_html)
#download pic
#self.__downloadPic(data_pic)
global downloadUrlList
global downloadTitleList
#downloadList = downloadList + data_pic
count = 0
#下面开始下载点击更多之后的图片
while max_time  and count <= 1:
count = count + 1
print "=--------------------THIS IS THE " + str(count) + " Json Data  Time : " + str(max_time) + "--------------------"
url = self.base_url + str(max_time)
json_data = self.__getHtml(url)
json_ret  = self.__parseJson(json_data)
max_time =  json_ret['max_time']
print max_time
image_url = json_ret['image_url']
image_content = json_ret['image_content']
#self.__downLoadPic(image_url,image_content)
downloadUrlList = downloadUrlList + image_url
downloadTitleList = downloadTitleList + image_content
#python 以两个下划线开始的为私有函数
#尝试5次

#解析json,并获取json中的数据
def __parseJson(self,json_data):
print "------This is parse_json --------"
dct = json.loads(json_data)
image_content = []
image_url = []
max_time   = ""
try :
max_time = dct['data']['max_time']
data = dct['data']['data']
for item in data:
content = item['group']['content']
url     = item['group']['large_image']['url_list'][0]['url']
image_content.append(content)
image_url.append(url)

ret = {}
ret['image_content'] = image_content
ret['image_url']    =  image_url
ret['max_time']   = max_time
return ret
except :
print "json_parse error"

def __getDataPic(self,html):
re_str = r'data-pic="([^"]*)"'
data_pic = self.__getDataByRe(html,re_str)
return data_pic

def __getMaxTime(self,html):
re_str = r'max_time: \'([\d]*)\''
max_time = self.__getDataByRe(html,re_str)
return max_time

def __getDataByRe(self,text,re_str):
pattern = re.compile(re_str)
ret = pattern.findall(text)
return ret

def __getHtml(self,url):
print "GET HTML********"
count = 0
while count < 5:
count = count + 1
print str(count) + " times ,try download html"
html = self.__getDataByUrl(url)
if not html:
continue;
else:
return html
def __getDataByUrl(self,url):
print "---------------now get html from url :" + url + "----------"
send_headers = {
'Host':'neihanshequ.com',
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:37.0) Gecko/20100101 Firefox/37.0',
'Cookie':"pksrqup=1; csrftoken=237f4451075fe45cef3a4f5449f70658; tt_webid=3379513254; uuid=\"w:33266c46f0cc4fa6944c073b1b1bccea\"",
'Connection':'keep-alive'
}
req = urllib2.Request(url ,headers=send_headers)
try:
response = urllib2.urlopen(req ,timeout = 100)
html = response.read()
return html
except urllib2.HTTPError, e:
print 'The server couldn\'t fulfill the request.'
print 'Error code: ', e.code
except urllib2.URLError, e:
print 'We failed to reach a server.'
print 'Reason: ', e.reason
else:
print 'No exception was raised.'

class myDownLoad (threading.Thread):
def __init__(self, threadID, name):
threading.Thread.__init__(self)
self.threadID = threadID
self.name = name
def run(self):
print "Starting " + self.name
# 获得锁,成功获得锁定后返回True
# 可选的timeout参数不填时将一直阻塞直到获得锁定
# 否则超时后将返回False
global pos
global size
global downloadUrlList
global downloadTitleList

#while threadLock.acquire():   陷入死循环
#if pos + 1 >= size :
#threadLock.release()
#return;
while  pos < size - 1 :
ret = threadLock.acquire()
if not ret :
break
pos = pos + 1
temp_pos = pos
# 释放锁
threadLock.release()
try :
tail = downloadTitleList[temp_pos]
image_url = downloadUrlList[temp_pos]
fullPath = "C:\\Users\\Administrator\\Desktop\\python\\" + tail + ".jpg"
urllib.urlretrieve(image_url , fullPath)
print "Pos :" + str(temp_pos) + "  DownLoad Ok----------"
except :
failedMsg = "第" + str(temp_pos) + "张下载失败,URL: " + str(image_url) + ""
print failedMsg
pass
threading.exit()

#------------------------------------------程序入口处------------------------------
startTime = time.time()
downloadUrlList = []
downloadTitleList = []
pos = 0
size = 0
mySpider = neiHanSpider()
mySpider.Start()

print str(len(downloadUrlList)) + "----->" + str(len(downloadTitleList))

threadLock = threading.Lock()
threads = []
size =  len(downloadUrlList)

for i in range(1,10) :
thread = myDownLoad(i,"Thread-" + str(i));
thread.start()
threads.append(thread)

aliveCount =  10
while aliveCount > 1 :
print "Now There is " + str(aliveCount) + "Threads alive"
aliveCount = threading.activeCount()
time.sleep(10)

endTime = time.time()
print " Download " + str(size) + "张图,共耗时 " + str((endTime - startTime) / 60) + "min"
print "Exiting Main Thread"


可能写的不是很整洁,有时间再整理。python现学现用,欢迎批评指正
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: