【Python】模拟登陆并抓取拉勾网信息(selenium+phantomjs)
2017-10-15 18:27
781 查看
环境
python3.5pip install selenium
phantomjs-2.1.1
pip install pyquery
代码
# -*- coding:utf-8 -*- # 防止print中文出错 import time import sys import io sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='gb18030') from pyquery import PyQuery as pq from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.desired_capabilities import DesiredCapabilities # 给phantomjs设置请求头 dcap = dict(DesiredCapabilities.PHANTOMJS) dcap["phantomjs.page.settings.userAgent"] = ( "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36" ) driver = webdriver.PhantomJS(desired_capabilities=dcap, executable_path=r"C:\Users\DELL\Desktop\Scrapy\phantomjs-2.1.1-windows\bin\phantomjs.exe") driver.set_window_size(400, 100) # 模拟登陆 def login(login_url, username, password): print("begin login...") try: driver.get(login_url) driver.find_element_by_css_selector(".input_item.clearfix[data-propertyname='username'] input").send_keys(username) driver.find_element_by_css_selector(".input_item.clearfix[data-propertyname='password'] input").send_keys(password) driver.find_element_by_css_selector(".input_item.btn_group.clearfix[data-propertyname='submit'] input").click() except: print("login wrong...") # 模拟搜索 def search_position(position_name): print("search position {}".format(position_name)) try: search_input = WebDriverWait(driver, 10).until( EC.presence_of_element_located((By.ID, "search_input")) ) search_input.send_keys(position_name) search_btn = WebDriverWait(driver, 10).until( EC.presence_of_element_located((By.ID, "search_button")) ) search_btn.click() except: print("search wrong...") # 递归,逐页解析页面 def parse_html(): print("begin parse html...") try: next_page_label = WebDriverWait(driver, 10).until( EC.presence_of_element_located((By.CSS_SELECTOR, ".item_con_pager .pager_container span:last-child")) ) html = pq(driver.page_source) items = html("#s_position_list .item_con_list li.con_list_item.default_list").items() for item in items: print(item.attr("data-company")) print(item.attr("data-positionname")) print(item.attr("data-salary")) print(item("a.position_link").attr("href")) print("\n") next_page_label.click() time.sleep(3) parse_html() except Exception as e: print(str(e)) if __name__ == "__main__": login_url = "https://passport.lagou.com/login/login.html?ts=1508055021059&serviceId=lagou&service=https%253A%252F%252Fwww.lagou.com%252F&action=login&signature=101A9F09764AD83E3E2A035A1506AF7A" username = "用户名" password = "用户密码" login(login_url, username, password) search_position("python") parse_html()
效果
相关文章推荐
- python3 [爬虫入门实战] 爬虫之selenium 模拟QQ登陆抓取好友说说内容(暂留)
- [Python爬虫] 之二十六:Selenium +phantomjs 利用 pyquery抓取智能电视网站图片信息
- Python使用Selenium模块实现模拟浏览器抓取淘宝商品美食信息功能示例
- Python使用Selenium模块模拟浏览器抓取斗鱼直播间信息示例
- python+selenium+phantomjs 模拟淘宝登陆
- Python利用selenium模拟浏览器抓取异步加载等难爬页面信息
- php的curl扩展抓取信息——模拟登陆成功却无法抓取页面等问题
- selenium + firefox/chrome/phantomjs登陆之模拟点击
- 使用scrapy 模拟登陆网站后 抓取会员中心相关信息
- 【python日常一】使用python抓取拉勾网职位信息并做简单统计分析
- 如何用Python去实现抓取静态网页+抓取动态网页+模拟登陆网站
- [Python爬虫] 之二十三:Selenium +phantomjs 利用 pyquery抓取智能电视网数据
- [置顶] 【python 百度指数抓取】python 模拟登陆百度指数,图像识别百度指数
- python之selenium模拟网站登陆
- Python爬虫使用Selenium+PhantomJS抓取Ajax和动态HTML内容
- 如何用Python,C#等语言去实现抓取静态网页+抓取动态网页+模拟登陆网站
- 直播网站LiveTV Mining,爬虫抓取数据 python3+selenium+phantomjs
- 【转】详解抓取网站,模拟登陆,抓取动态网页的原理和实现(Python,C#等)
- [Python爬虫] 之十四:Selenium +phantomjs抓取媒介360数据
- [Python爬虫] 之二十四:Selenium +phantomjs 利用 pyquery抓取中广互联网数据