pyspider爬虫学习-教程3-Render-with-PhantomJS.md
2017-09-07 00:00
579 查看
摘要: pyspider爬虫学习-教程3 Render with PhantomJS
Level 3: Render with PhantomJS ============================== #有时web页面太复杂,无法找到API请求。现在是时候迎接[PhantomJS]的力量了 Sometimes web page is too complex to find out the API request. It's time to meet the power of [PhantomJS]. #使用PhantomJS,你应该已经安装PhantomJS(http://phantomjs.org/download.html)。如果你运行的是'all'模式的pyspider,如果在“路径”中有excutable,则会启用PhantomJS。 To use PhantomJS, you should have PhantomJS [installed](http://phantomjs.org/download.html). If you are running pyspider with `all` mode, PhantomJS is enabled if excutable in the `PATH`. #确保phantomjs正在运行 Make sure phantomjs is working by running ``` $ pyspider phantomjs ``` 如果输出是Web服务器运行在25555端口上,则继续教程的其余部分 Continue with the rest of the tutorial if the output is ``` Web server running on port 25555 ``` #使用PhantomJS Use PhantomJS ------------- # #当pyspider与PhantomJS连接在一起时,您可以通过向'self.crawl'添加一个参数“fetch_type='js'”来启用这个特性。我们使用PhantomJS获取频道列表[http://www.twitch.tv/directory/game/Dota%202](http://www.twitch.tv/directory/game/Dota%202)中使用AJAX加载我们讨论[Level 2](tutorial/AJAX-and-more-HTTP#ajax): When pyspider with PhantomJS connected, you can enable this feature by adding a parameter `fetch_type='js'` to `self.crawl`. We use PhantomJS to scrape channel list of [http://www.twitch.tv/directory/game/Dota%202](http://www.twitch.tv/directory/game/Dota%202) which is loaded with AJAX we discussed in [Level 2](tutorial/AJAX-and-more-HTTP#ajax): ``` class Handler(BaseHandler): def on_start(self): self.crawl('http://www.twitch.tv/directory/game/Dota%202', fetch_type='js', callback=self.index_page) def index_page(self, response): return { "url": response.url, "channels": [{ "title": x('.title').text(), "viewers": x('.info').contents()[2], "name": x('.info a').text(), } for x in response.doc('.stream.item').items()] } ``` #我使用了一些API来处理流列表。你可以找到完整的API参考[PyQuery完整的API](https://pythonhosted.org/pyquery/api.html) > I used some API to handle the list of streams. You can find complete API reference from [PyQuery complete API](https://pythonhosted.org/pyquery/api.html) #在页面运行JavaScript脚本 Running JavaScript on Page -------------------------- #在这一部分中我们将尝试截取图像[http://www.pinterest.com/categories/popular/](http://www.pinterest.com/categories/popular/)。在开始时,只有25个图像显示,当你滚动到页面底部时,会加载更多的图片。 We will try to scrape images from [http://www.pinterest.com/categories/popular/](http://www.pinterest.com/categories/popular/) in this section. Only 25 images is shown at the beginning, more images would be loaded when you scroll to the bottom of the page. #我们可以使用[`js_script` parameter](/apis/self.crawl/#enable-javascript-fetcher-need-support-by-fetcher)来对图像进行尽可能多的抓取,设置一些JavaScript代码包装的函数来模拟滚动操作 To scrape images as many as posible we can use a [`js_script` parameter](/apis/self.crawl/#enable-javascript-fetcher-need-support-by-fetcher) to set some function wrapped JavaScript codes to simulate the scroll action: ``` class Handler(BaseHandler): def on_start(self): self.crawl('http://www.pinterest.com/categories/popular/', fetch_type='js', js_script=""" function() { window.scrollTo(0,document.body.scrollHeight); } """, callback=self.index_page) def index_page(self, response): return { "url": response.url, "images": [{ "title": x('.richPinGridTitle').text(), "img": x('.pinImg').attr('src'), "author": x('.creditName').text(), } for x in response.doc('.item').items() if x('.pinImg')] } ``` # 脚本在加载页面后执行(可以通过[`js_run_at` parameter](/apis/self.crawl/#enable-javascript-fetcher-need-support-by-fetcher)进行更改)。 > * Script would been executed after page loaded(can been changed via [`js_run_at` parameter](/apis/self.crawl/#enable-javascript-fetcher-need-support-by-fetcher)) #我们滚动页面加载后,你可以滚动多次在[`setTimeout`]时间内(https://developer.mozilla.org/en-US/docs/Web/API/WindowTimers.setTimeout)。PhantomJS将在超时之前获取尽可能多的项目。 > * We scroll once after page loaded, you can scroll multiple times using [`setTimeout`](https://developer.mozilla.org/en-US/docs/Web/API/WindowTimers.setTimeout). PhantomJS will fetch as many items as possible before timeout arrived. #在线实例 Online demo: [http://demo.pyspider.org/debug/tutorial_pinterest](http://demo.pyspider.org/debug/tutorial_pinterest) [PhantomJS]: http://phantomjs.org/[/code]
相关文章推荐
- pyspider爬虫学习-教程1-HTML-and-CSS-Selector.md
- pyspider爬虫学习-文档翻译-Working-with-Results.md
- pyspider爬虫学习-教程2-AJAX-and-more-HTTP.md
- pyspider爬虫学习-文档翻译-About-Tasks.md
- pyspider爬虫学习-文档翻译-Deployment.md
- pyspider爬虫学习-文档翻译-Script-Environment.md
- pyspider爬虫学习-API-self.crawl.md
- pyspider爬虫学习-文档翻译-About-Projects.md
- pyspider爬虫学习-文档翻译-Command-Line.md
- pyspider爬虫学习-API-self.send_message.md
- pyspider爬虫学习-API-Response.md
- pyspider 爬虫教程(三):使用 PhantomJS 渲染带 JS 的页面
- pyspider爬虫学习-文档翻译-Architecture.md
- pyspider爬虫学习-文档翻译-Frequently-Asked-Questions.md
- pyspider爬虫学习-文档翻译-index.md
- pyspider 爬虫教程(三)
- python爬虫入门(7) pyspider学习1
- 爬虫学习 pyspider和scrapy小结 / 与其他工具对比
- Python实战:Python爬虫学习教程,获取电影排行榜