(python解析js)scrapy结合ghost抓取js生成的页面,以及js变量的解析
2016-04-10 16:26
816 查看
现在页面用ajax的越来越多, 好多代码是通过js执行结果显示在页面的(比如:http://news.sohu.com/scroll/,搜狐滚动新闻的列表是在页面请求时由后台一次性将数据渲染到前台js变量newsJason和arrNews里面的,然后再由js生成div和li,故要想或得结果必须要解析执行js), 所以在scrapy抓取过程中就需要通过一个中间件来执行这个js代码。
scrapy 本身不能作为js engine,这就导致很多js生成的页面的数据会无法抓取到,因此,一些通用做法是使用webkit或基于webkit的库。
Ghost是一个python的webkit客户端,基于webkit的核心,使用了pyqt或者pyside的webkit实现。
2.scrapy配置
在scrapy的settings.py中加入:
其中
JsSpider.middleware.rotate_useragent.RotateUserAgentMiddleware为user agent池
3.spider中解析js
scrapy 本身不能作为js engine,这就导致很多js生成的页面的数据会无法抓取到,因此,一些通用做法是使用webkit或基于webkit的库。
Ghost是一个python的webkit客户端,基于webkit的核心,使用了pyqt或者pyside的webkit实现。
安装:
1.安装sip(pyqt依赖): wget http://sourceforge.net/projects/pyqt/files/sip/sip-4.14.6/sip-4.14.6.tar.gz tar zxvf sip-4.14.6.tar.gz cd sip-4.14.6 python configure.py make sudo make install 2.安装pyqt wget http://sourceforge.net/projects/pyqt/files/PyQt4/PyQt-4.10.1/PyQt-x11-gpl-4.10.1.tar.gz tar zxvf PyQt-x11-gpl-4.10.1.tar.gz cd PyQt-mac-gpl-4.10.1 python configure.py make sudo make install 3.安装Ghost git clone git://github.com/carrerasrodrigo/Ghost.py.git cd Ghost.py sudo python setup.py install
scrapy使用ghost:
1.开发downloader middleware (webkit_js.py)from scrapy.http import Request,FormRequest,HtmlResponse import JsSpider.settings from ghost import Ghost class WebkitDownloader(object): def process_request(self,request,spider): if spider.name in JsSpider.settings.WEBKIT_DOWNLOADER: if(type(request) is not FormRequest): ghost = Ghost() session = ghost.start() session.open(request.url) result,resource = session.evaluate('document.documentElement.innerHTML') #保留会话到爬虫,用以在爬虫里面执行js代码 spider.webkit_session = session renderedBody = str(result.toUtf8()) #返回rendereBody就是执行了js后的页面 return HtmlResponse(request.url,body=renderedBody)
2.scrapy配置
在scrapy的settings.py中加入:
#which spider should use webkit WEBKIT_DOWNLOADER = ['spider_name'] DOWNLOADER_MIDDLEWARES = { #'JsSpider.middleware.rotate_useragent.RotateUserAgentMiddleware': 533, 'JsSpider.middleware.webkit_js.WebkitDownloader': 543, }
其中
JsSpider.middleware.rotate_useragent.RotateUserAgentMiddleware为user agent池
# -*-coding:utf-8-*- from scrapy import log """避免被ban策略之一:使用useragent池。 使用注意:需在settings.py中进行相应的设置。 """ import random from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware class RotateUserAgentMiddleware(UserAgentMiddleware): def __init__(self, user_agent=''): self.user_agent = user_agent def process_request(self, request, spider): ua = random.choice(self.user_agent_list) if ua: #显示当前使用的useragent #print "********Current UserAgent:%s************" %ua #记录 log.msg('Current UserAgent: '+ua, level=log.INFO) request.headers.setdefault('User-Agent', ua) #the default user_agent_list composes chrome,I E,firefox,Mozilla,opera,netscape #for more user agent strings,you can find it in http://www.useragentstring.com/pages/useragentstring.php user_agent_list = [ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 " "(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1", "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 " "(KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 " "(KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 " "(KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 " "(KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 " "(KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 " "(KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 " "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 " "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24" ]
3.spider中解析js
from scrapy.spiders import Spider from scrapy.selector import Selector from scrapy.http import Request from JsSpider.items import JsSpiderItem #import sys #reload(sys) #sys.setdefaultencoding('utf-8') class JsSpider(Spider): def __init__(self): self.webkit_session = None name = "js" #download_delay = 3 allowed_domains = ["news.sohu.com"] start_urls = [ "http://news.sohu.com/scroll/" ] def parse(self, response): items = [] newsJason = self.webkit_session.evaluate('newsJason') #获得js对象 arrNews = self.webkit_session.evaluate('arrNews') #获得js对象 print type(newsJason) print type(arrNews) newsJason = newsJason[0] arrNews = arrNews[0] category =[v for k,v in newsJason.iteritems()][0] for i in range(len(category)): for j in range(len(category[i])): category[i][j]=str(category[i][j]) for i in range(len(arrNews)): for j in range(len(arrNews[i])): arrNews[i][j]=str(arrNews[i][j]) #ghost解析js返回的结果一般为QString对象,转换起来比较麻烦,特别是复杂的数据结果,层层嵌套都算QString类型,还有编码问题,后面数据的格式化,有空在弄吧 print category print arrNews return items
说明:
scrapy请求的页面通过中间件执行js后返回response给spider,此时的reponse的js变量里面有我们需要的数据,再通过spider初始化的ghost会话webkit_session执行js变量解析,使用了evaluate(javascript)函数,要获得js变量arrNews的值,只需要执行self.webkit_session.evaluate(‘arrNews’)相关文章推荐
- 基于python和mysql的查询操作
- selenium+python send_keys() 上传文件
- python列出指定文件夹下所有给定后缀名的文件
- python:mysql查询
- python --yield
- 解析python中的类:
- 【python日常一】使用python抓取拉勾网职位信息并做简单统计分析
- python错误汇总3:安装MySQLdb时:EnvironmentError: mysql_config not found
- 一、Python 进阶 之 函数式编程
- caffe用python时可能需要的模块安装
- python中隐式的内存共享
- python中隐式的内存共享
- Python贴吧小爬虫
- Head First Python 学习札记 2016-04-09
- python练习_12
- odoo8新API multi 装饰类详解
- python:获取mysql版本
- python基础练习(二)—— 数据分析包numpy数组操作
- python中np.random.randint()
- 探究Ubuntu14.04下多版本Python下连接MySQL数据库的安装