您的位置:首页 > 编程语言 > Python开发

(python解析js)scrapy结合ghost抓取js生成的页面,以及js变量的解析

2016-04-10 16:26 816 查看
现在页面用ajax的越来越多, 好多代码是通过js执行结果显示在页面的(比如:http://news.sohu.com/scroll/,搜狐滚动新闻的列表是在页面请求时由后台一次性将数据渲染到前台js变量newsJason和arrNews里面的,然后再由js生成div和li,故要想或得结果必须要解析执行js), 所以在scrapy抓取过程中就需要通过一个中间件来执行这个js代码。

scrapy 本身不能作为js engine,这就导致很多js生成的页面的数据会无法抓取到,因此,一些通用做法是使用webkit或基于webkit的库。

Ghost是一个python的webkit客户端,基于webkit的核心,使用了pyqt或者pyside的webkit实现。

安装:

1.安装sip(pyqt依赖):
wget http://sourceforge.net/projects/pyqt/files/sip/sip-4.14.6/sip-4.14.6.tar.gz tar zxvf sip-4.14.6.tar.gz
cd sip-4.14.6
python configure.py
make
sudo make install
2.安装pyqt
wget http://sourceforge.net/projects/pyqt/files/PyQt4/PyQt-4.10.1/PyQt-x11-gpl-4.10.1.tar.gz tar zxvf PyQt-x11-gpl-4.10.1.tar.gz
cd PyQt-mac-gpl-4.10.1
python configure.py
make
sudo make install
3.安装Ghost
git clone git://github.com/carrerasrodrigo/Ghost.py.git
cd Ghost.py
sudo python setup.py install


scrapy使用ghost:

1.开发downloader middleware (webkit_js.py)

from scrapy.http import Request,FormRequest,HtmlResponse
import JsSpider.settings
from ghost import Ghost
class WebkitDownloader(object):
def process_request(self,request,spider):
if spider.name in JsSpider.settings.WEBKIT_DOWNLOADER:
if(type(request) is not FormRequest):
ghost = Ghost()
session = ghost.start()
session.open(request.url)
result,resource = session.evaluate('document.documentElement.innerHTML')
#保留会话到爬虫,用以在爬虫里面执行js代码
spider.webkit_session = session
renderedBody = str(result.toUtf8())
#返回rendereBody就是执行了js后的页面
return HtmlResponse(request.url,body=renderedBody)


2.scrapy配置

在scrapy的settings.py中加入:

#which spider should use webkit
WEBKIT_DOWNLOADER = ['spider_name']
DOWNLOADER_MIDDLEWARES = {
#'JsSpider.middleware.rotate_useragent.RotateUserAgentMiddleware': 533,
'JsSpider.middleware.webkit_js.WebkitDownloader': 543,
}


其中

JsSpider.middleware.rotate_useragent.RotateUserAgentMiddleware为user agent池

# -*-coding:utf-8-*-
from scrapy import log
"""避免被ban策略之一:使用useragent池。
使用注意:需在settings.py中进行相应的设置。
"""
import random
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware

class RotateUserAgentMiddleware(UserAgentMiddleware):
def __init__(self, user_agent=''):
self.user_agent = user_agent
def process_request(self, request, spider):
ua = random.choice(self.user_agent_list)
if ua:
#显示当前使用的useragent
#print "********Current UserAgent:%s************" %ua
#记录
log.msg('Current UserAgent: '+ua, level=log.INFO)
request.headers.setdefault('User-Agent', ua)
#the default user_agent_list composes chrome,I E,firefox,Mozilla,opera,netscape
#for more user agent strings,you can find it in http://www.useragentstring.com/pages/useragentstring.php user_agent_list = [
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 "
"(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 "
"(KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 "
"(KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 "
"(KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 "
"(KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 "
"(KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 "
"(KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 "
"(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 "
"(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]


3.spider中解析js

from scrapy.spiders import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from JsSpider.items import JsSpiderItem

#import sys
#reload(sys)
#sys.setdefaultencoding('utf-8')

class JsSpider(Spider):
def __init__(self):
self.webkit_session = None
name = "js"
#download_delay = 3
allowed_domains = ["news.sohu.com"]
start_urls = [
"http://news.sohu.com/scroll/"
]

def parse(self, response):
items = []
newsJason = self.webkit_session.evaluate('newsJason')   #获得js对象
arrNews = self.webkit_session.evaluate('arrNews')       #获得js对象
print type(newsJason)
print type(arrNews)

newsJason = newsJason[0]
arrNews = arrNews[0]

category =[v for k,v in newsJason.iteritems()][0]

for i in range(len(category)):
for j in range(len(category[i])):
category[i][j]=str(category[i][j])

for i in range(len(arrNews)):
for j in range(len(arrNews[i])):
arrNews[i][j]=str(arrNews[i][j])
#ghost解析js返回的结果一般为QString对象,转换起来比较麻烦,特别是复杂的数据结果,层层嵌套都算QString类型,还有编码问题,后面数据的格式化,有空在弄吧
print category
print arrNews

return items


说明:

scrapy请求的页面通过中间件执行js后返回response给spider,此时的reponse的js变量里面有我们需要的数据,再通过spider初始化的ghost会话webkit_session执行js变量解析,使用了evaluate(javascript)函数,要获得js变量arrNews的值,只需要执行self.webkit_session.evaluate(‘arrNews’)
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: