您的位置：首页 > Web前端 > JavaScript

利用phantomjs抓取js生成的页面

2013-07-25 14:02 120 查看

最近研究如何能提高在openstack社区提交merge代码的成功率，打算把其他人提的review都抓下来做点统计分析

review info页的url pattern倒是很简单：https://review.openstack.org/#/c/{id}/，可惜页面是js生成的，wget搞不定

google了一下，测试框架+js引擎的方案的确比较全面，但对我来说无疑是大炮打蚊子，光一个pywebkit就折腾半天没装好。后来找到了phantomjs，发现非常适合我的需求，用法也极为简单：

1. 下载适合的压缩包，下载页: http://phantomjs.org/download.html
2. 以linux版为例，解压后有bin和example两个目录，其中bin/phantomjs可以直接执行(具体参数可通过bin/phantomjs --help获取)

3. 在examples/下有个phantomwebintro.js，功能是下载http://www.phantomjs.org首页中的intro信息，复制一份稍加修改就能可以了：

// Read the Phantom webpage '#intro' element text using jQuery and "includeJs"

var page = require('webpage').create();

page.onConsoleMessage = function(msg) {

console.log(msg);

};

page.open("https://review.openstack.org/#/c/38576/", function(status) {

if ( status === "success" ) {

page.includeJs("http://ajax.googleapis.com/ajax/libs/jquery/1.6.1/jquery.min.js", function() {

page.evaluate(function() {

console.log("$(\"#[b]gerrit_body\").text() -> " + $("#gerrit_body").text()); [/b]

});

phantom.exit();

});

}

});

4. 用chrome打开需要抓取的页面，通过审查元素可以发现需要抓取的主要内容都在div id="gerrit_body" 里，对js做对应修改就可以直接用了：

bin/phantomjs test.js >test_result

5.如果想保留html标签，可以把$("#gerrit_body").text()改成$("#gerrit_body").html()

更多的例子可以参考：
https://github.com/ariya/phantomjs/wiki/Examples

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航