您的位置：首页 > Web前端 > JavaScript

使用Selenium和PhantomJS解析带JS的网页

2015-05-06 15:32 489 查看

转自：http://smilejay.com/2013/12/try-phantomjs-with-selenium/

有的网页，不能直接通过wget/curl等命令、或者直接使用Python中的liburl这样的函数库来直接获取其真正展现给用户的信息，因为里面包含有JavaScript脚本（而该JS和页面数据的生成相关），需要通过Firefox、Chrome等浏览器渲染后才能得到想要看的结果。

例如，我想查询的一个根据IP查询到地理位置的网页：http://www.ip.cn/125.95.26.81

为了写程序来自动获取我想要的数据，比如 http://www.ip.cn/125.95.26.81 中网页中的“广东省佛山市电信”这几个字。一般来说，有如下两种方案：

1. 写Web UI自动化脚本，用Selenium启动真正的浏览器（如：IE、Firefox）来打开该网页，然后调用webdriver获取想要的页面元素。

2. 找一种浏览器渲染引擎，能够让其解析网页并执行网页中需要初始化JS，然后将JS、CSS等执行后的HTML代码输出出来。

启动真正的浏览器，可能带来两个问题：一个是需要的时间较长，另一个是UI自动化易受干扰、不够稳定。

而第2个方法，一时没有找到特别好的库（暂用Python语言）。

根据网上的一些方案和请教同事，最后在Selenium webdriver中找到了不启动浏览器但是带基于Webkit引擎的名为“PhantomJS”的driver。后来找资料发现，LinkedIn、Twitter等知名互联网公司也在使用PhantomJS用于测试。

对于PhantomJS的好处，可阅读：http://phantomjs.org/ (Headless Website Testing， Screen Capture，Page Automation， Network Monitoring)

对于哪些情况下不适合用PhantomJS而应该用真正的Browser，可阅读：http://www.chrisle.me/2013/08/5-reasons-i-chose-selenium-over-phantomjs/

这里就不专门说PhantomJS的优劣势了，不过，它能解决我当前的问题。

先通过官方网站下载PhantomJS的可执行文件即可；然后像正常写Selenium自动化脚本一样来做即可。

我的一个示例程序如下：

View
Code BASH

#!/usr/bin/python
# -*- coding: utf-8 -*-

'''
Created on Dec 6, 2013

@author: Jay <smile665@gmail.com>
@description: use PhantomJS to parse a web page to get the geo info of an IP
'''

from selenium import webdriver

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

driver = webdriver.PhantomJS(executable_path='./phantomjs')  #这要可能需要制定phatomjs可执行文件的位置
driver.get("http://www.ip.cn/125.95.26.81")
#print driver.current_url
#print driver.page_source
print driver.find_element_by_id('result').text.split('\n')[0].split('来自：')[1]
driver.quit

View
Code BASH

1 2	jay@jay-linux:~/workspace/python_test$ python try_phantomjs.py 广东省佛山市电信

当然，刚好目前的Selenium（2.38.2）和PhontomJS（1.9.2）一起用有bug，见我另一篇文章：Selenium
2.38.2 和 PhantomJS 1.9.2 一起使用的一个Bug

参考资料：

很好的入门指引：http://www.realpython.com/blog/python/headless-selenium-testing-with-python-and-phantomjs/

官方说明：
https://github.com/detro/ghostdriver http://phantomjs.org/ http://phantomjs.org/users.html
一个和PhantomJS类似的东东，不过它基于Gecko而不是Webkit：http://slimerjs.org/

这里有位兄台也使用PhantomJS抓取数据，可以看一下：http://blog.chinaunix.net/uid-22414998-id-3692113.html

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航