【Python爬虫实战】使用selenium完美实现拉勾网爬虫
2020-01-11 17:41
471 查看
最近都在学Python爬虫,感觉用selenium爬虫还是挺好用的,所以就写了这么一篇实战博客,如果错误,请多多执教!
首先,我们就以爬取python工程师为例来展开,我们此次爬虫要用到chromedriver这个模拟器,此外还用到这些库:
from selenium import webdriver from selenium.webdriver.support.ui import Select,WebDriverWait from selenium.webdriver.common.action_chains import ActionChains from selenium.webdriver.common.by import By from selenium.webdriver.remote.webelement import WebElement from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.keys import Keys from lxml import etree import time import re import csv
我们在pip命令导入这些库。
导入完之后,我们要定义一下driver的路径:
然后开始写代码了,我们可以把爬取拉勾网的过程分为六大部分,分别是:
1.初始化:
driver_path = r'C:\Users\professor\AppData\Local\Google\Chrome\Application\chromedriver.exe' #此路径为你安装chromedriver这个软件的路径
def __init__(self): # 初始化一个driver,并且指定chromedriver的路径 self.driver = webdriver.Chrome(executable_path=self.driver_path) self.positions = [] self.fp = open('lagou.csv','a',encoding='utf-8',newline='') self.writer = csv.DictWriter(self.fp,['title','salary','city','work_years','education',"company_website",'desc','acquire','origin_url']) self.writer.writeheader()
2.开始跑python的首页:
def run(self): url = 'https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput=' self.driver.get(url) while True: WebDriverWait(driver=self.driver,timeout=10).until( EC.presence_of_element_located((By.XPATH,"//span[contains(@class,'pager_next')]")) ) resource = self.driver.page_source self.parse_list_page(resource) next_btn = self.driver.find_element_by_xpath("//span[contains(@class,'pager_next')]") if "pager_next_disabled" in next_btn.get_attribute('class'): break else: next_btn.click() time.sleep(1)
3.要爬取详情页的链接:
def parse_list_page(self,resource): html = etree.HTML(resource) links = html.xpath("//a[@class='position_link']/@href") #这个涉及到Xpath语法,详情可以看我上一个博客:[Xpath语法](https://editor.csdn.net/md/?articleId=103486019) for link in links: self.parse_detail_page(link) time.sleep(1)
4.爬取详情页的内容:
self.driver.execute_script("window.open('"+url+"')") #要打开另一个新的窗口 self.driver.switch_to.window(self.driver.window_handles[1]) WebDriverWait(self.driver,timeout=10).until( EC.presence_of_element_located((By.XPATH,"//dd[@class='job_bt']")) ) resource = self.driver.page_source html = etree.HTML(resource) title = html.xpath("//span[@class='name']/text()")[0] company = html.xpath("//em[@class='fl-cn']/text()")[0].strip() job_request_span = html.xpath("//dd[@class='job_request']//span") salary = job_request_span[0].xpath(".//text()")[0] salary = salary.strip() city = job_request_span[1].xpath(".//text()")[0] city = re.sub(r"[/\s]","",city) work_years = job_request_span[2].xpath(".//text()")[0] work_years = re.sub(r"[/\s]","",work_years) education = job_request_span[3].xpath(".//text()")[0] education = re.sub(r"[/\s]","",education) company_website = html.xpath("//ul[@class='c_feature']/li[last()]/a/@href")[0] position_desc = "".join(html.xpath("//div[@class='job-detail']//text()")) position = { 'title': title, 'city': city, 'salary': salary, 'company': company, 'company_website': company_website, 'education': education, 'work_years': work_years, 'desc': position_desc, 'origin_url': url } self.driver.close() self.driver.switch_to.window(self.driver.window_handles[0]) self.write_position(position)
5.写入到CSS中:
def write_position(self,position): if len(self.positions) >= 100: self.writer.writerows(self.positions) self.positions.clear() self.positions.append(position) print(position)
以上代码只可以爬取少量数据,尚未破解到拉勾网的反爬虫机制!
- 点赞
- 收藏
- 分享
- 文章举报
相关文章推荐
- python3[爬虫实战] 使用selenium,xpath爬取京东手机(下)
- python3 [爬虫实战] 微博爬虫京东客服之Selenium + Chrom浏览器的使用(上)
- python3 [爬虫入门实战] 爬虫之使用selenium 爬取百度招聘内容并存mongodb
- Python爬虫入门实战七:使用Selenium--以抓取QQ空间好友说说为例
- 爬虫实战15:selenium爬取拉勾网Python职位表,并保存到MySQL中
- python3 [爬虫入门实战]爬虫之selenium 安装设置与初步使用
- python3[爬虫实战] 使用selenium,xpath爬取京东手机(下)
- python3[爬虫实战] 使用selenium,xpath爬取京东手机(上)
- python3 简单爬虫实战|使用selenium来模拟浏览器抓取选股宝网站信息里面的股票
- 【Python爬虫实战】使用Scrapy框架-CrawlSpider实现微信小程序社区爬虫
- selenium完美实现拉勾网爬虫
- python3[爬虫实战] 使用selenium,xpath爬取京东手机(上)
- 简单爬虫python实现02——BeautifulSoup的使用
- 使用python实现简单的百度百科词条爬虫
- [Python爬虫] Selenium自动访问Firefox和Chrome并实现搜索截图
- Python爬虫实战之使用Scrapy爬起点网的完本小说
- 用Python+Selenium+PhantomJS实现采集动态数据的小爬虫
- Python爬虫实战八之利用Selenium抓取淘宝匿名旺旺
- [Python爬虫] Selenium自动访问Firefox和Chrome并实现搜索截图
- python3实现网络爬虫(3)--BeautifulSoup使用(2)