selenium+BeautifulSoup实现强大的爬虫功能
2017-09-18 00:00
441 查看
sublime下运行
selenium
phantomjs
采用方式可以下载后安装,本文采用pip
pip install BeautifulSoup
pip install selenium
pip install phantomjs
BeautifulSoup
1 下载并安装必要的插件
BeautifulSoupselenium
phantomjs
采用方式可以下载后安装,本文采用pip
pip install BeautifulSoup
pip install selenium
pip install phantomjs
2 核心代码
phantomjs解析def driver_open(): dcap = dict(DesiredCapabilities.PHANTOMJS) dcap["phantomjs.page.settings.userAgent"] = (r"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3100.0 Safari/537.36") driver = webdriver.PhantomJS(executable_path=r'C:\Users\Administrator\AppData\Roaming\Sublime Text 3\Packages\Anaconda\phantomjs.exe', desired_capabilities=dcap) return driver
BeautifulSoup
def get_content(driver,url): driver.get(url) time.sleep(30) content = driver.page_source.encode('utf-8') driver.close() soup = BeautifulSoup(content, 'lxml') return soup
3 源码
#!/usr/bin/env python # -*- coding:utf-8 -*- from selenium import webdriver import time from bs4 import BeautifulSoup from selenium.webdriver.common.desired_capabilities import DesiredCapabilities def driver_open(): dcap = dict(DesiredCapabilities.PHANTOMJS) dcap["phantomjs.page.settings.userAgent"] = (r"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3100.0 Safari/537.36") driver = webdriver.PhantomJS(executable_path=r'C:\Users\Administrator\AppData\Roaming\Sublime Text 3\Packages\Anaconda\phantomjs.exe', desired_capabilities=dcap) return driver def get_content(driver,url): driver.get(url) time.sleep(30) content = driver.page_source.encode('utf-8') driver.close() soup = BeautifulSoup(content, 'lxml') return soup def get_basic_info(soup): basic_info = soup.select('.baseInfo_model2017') zt = soup.select('.td-regStatus-value > p ')[0].text.replace("\n","").replace(" ","") basics = soup.select('.basic-td > .c8 > .ng-binding ') zzjgdm = basics[3].text tyshxydm = basics[7].text print (u'公司名称:'+company) print (u'公司状态:'+zt) # print basics print (u'组织机构代码:'+zzjgdm) print (u'统一社会信用代码:'+tyshxydm) if __name__=='__main__': url = "http://www.tianyancha.com/company/2310290454" driver = driver_open() soup = get_content(driver, url) print(soup.body.text) print('----获取基础信息----') get_basic_info(soup)
相关文章推荐
- selenium+BeautifulSoup实现强大的爬虫功能
- 使用requests+beautifulsoup模块实现python网络爬虫功能
- Python Beautiful Soup+requests实现爬虫
- python+selenium+phantomjs实现爬虫功能
- selenium配合phantomjs实现爬虫功能,并把抓取的数据写入excel
- python+beautifulsoup/xpath实现新浪微博某互粉好友全部好友圈微博爬虫
- Selenium + PhantomJS + python 简单实现爬虫的功能
- Selenium + PhantomJS + python 简单实现爬虫的功能
- Selenium + PhantomJS + python 简单实现爬虫的功能
- python+selenium调用浏览器(IE-Chrome-Firefox)实现爬虫功能
- python3爬虫-urllib+BeautifulSoup
- 第三百五十节,Python分布式爬虫打造搜索引擎Scrapy精讲—selenium模块是一个python操作浏览器软件的一个模块,可以实现js动态网页请求
- HttpClient实现简单的网络爬虫功能
- python+beautifulsoup/xpath实现新浪微博已删除图片恢复(复杂度很高只介绍原理)
- python3.3 lxml+beautifulsoup 爬虫说明
- 用递归的方法来实现强大的全排列功能
- python实现简单爬虫功能
- 编译原理:用Flex和 Bison实现一个功能更为强大的计算器
- CKEditor+SWFUpload实现功能较为强大的编辑器(一)---CKEditor配置