您的位置:首页 > 编程语言 > Python开发

【Python爬虫实战】使用selenium完美实现拉勾网爬虫

2020-01-11 17:41 471 查看

最近都在学Python爬虫,感觉用selenium爬虫还是挺好用的,所以就写了这么一篇实战博客,如果错误,请多多执教!
首先,我们就以爬取python工程师为例来展开,我们此次爬虫要用到chromedriver这个模拟器,此外还用到这些库:

from selenium import webdriver
from selenium.webdriver.support.ui import Select,WebDriverWait
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.by import By
from selenium.webdriver.remote.webelement import WebElement
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
from lxml import etree
import time
import re
import csv

我们在pip命令导入这些库。
导入完之后,我们要定义一下driver的路径:

然后开始写代码了,我们可以把爬取拉勾网的过程分为六大部分,分别是:

1.初始化:

driver_path = r'C:\Users\professor\AppData\Local\Google\Chrome\Application\chromedriver.exe'
#此路径为你安装chromedriver这个软件的路径
def __init__(self):
# 初始化一个driver,并且指定chromedriver的路径
self.driver = webdriver.Chrome(executable_path=self.driver_path)
self.positions = []
self.fp = open('lagou.csv','a',encoding='utf-8',newline='')
self.writer = csv.DictWriter(self.fp,['title','salary','city','work_years','education',"company_website",'desc','acquire','origin_url'])
self.writer.writeheader()

2.开始跑python的首页:

def run(self):
url = 'https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput='
self.driver.get(url)
while True:
WebDriverWait(driver=self.driver,timeout=10).until(
EC.presence_of_element_located((By.XPATH,"//span[contains(@class,'pager_next')]"))
)
resource = self.driver.page_source
self.parse_list_page(resource)
next_btn = self.driver.find_element_by_xpath("//span[contains(@class,'pager_next')]")
if "pager_next_disabled" in next_btn.get_attribute('class'):
break
else:
next_btn.click()
time.sleep(1)

3.要爬取详情页的链接:

def parse_list_page(self,resource):
html = etree.HTML(resource)
links = html.xpath("//a[@class='position_link']/@href")
#这个涉及到Xpath语法,详情可以看我上一个博客:[Xpath语法](https://editor.csdn.net/md/?articleId=103486019)
for link in links:
self.parse_detail_page(link)
time.sleep(1)

4.爬取详情页的内容:

self.driver.execute_script("window.open('"+url+"')")
#要打开另一个新的窗口
self.driver.switch_to.window(self.driver.window_handles[1])
WebDriverWait(self.driver,timeout=10).until(
EC.presence_of_element_located((By.XPATH,"//dd[@class='job_bt']"))
)
resource = self.driver.page_source
html = etree.HTML(resource)
title = html.xpath("//span[@class='name']/text()")[0]
company = html.xpath("//em[@class='fl-cn']/text()")[0].strip()
job_request_span = html.xpath("//dd[@class='job_request']//span")
salary = job_request_span[0].xpath(".//text()")[0]
salary = salary.strip()
city = job_request_span[1].xpath(".//text()")[0]
city = re.sub(r"[/\s]","",city)
work_years = job_request_span[2].xpath(".//text()")[0]
work_years = re.sub(r"[/\s]","",work_years)
education = job_request_span[3].xpath(".//text()")[0]
education = re.sub(r"[/\s]","",education)
company_website = html.xpath("//ul[@class='c_feature']/li[last()]/a/@href")[0]
position_desc = "".join(html.xpath("//div[@class='job-detail']//text()"))
position = {
'title': title,
'city': city,
'salary': salary,
'company': company,
'company_website': company_website,
'education': education,
'work_years': work_years,
'desc': position_desc,
'origin_url': url
}
self.driver.close()
self.driver.switch_to.window(self.driver.window_handles[0])
self.write_position(position)

5.写入到CSS中:

def write_position(self,position):
if len(self.positions) >= 100:
self.writer.writerows(self.positions)
self.positions.clear()
self.positions.append(position)
print(position)

以上代码只可以爬取少量数据,尚未破解到拉勾网的反爬虫机制!

  • 点赞
  • 收藏
  • 分享
  • 文章举报
Professor麦 发布了15 篇原创文章 · 获赞 3 · 访问量 343 私信 关注
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: