爬虫实战15:selenium爬取拉勾网Python职位表,并保存到MySQL中
2019-06-09 20:56
399 查看
import requests import time from lxml import etree from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.support.wait import WebDriverWait from selenium.common.exceptions import NoSuchElementException import pymysql def get_page(): url = "https://www.lagou.com/" browser = webdriver.Chrome() browser.get(url) quanguo = browser.find_element_by_id("cboxClose") quanguo.click() wait = WebDriverWait(browser, 30) # input = wait.until(EC.presence_of_all_elements_located((By.ID, 'search_input'))) time.sleep(5) input = browser.find_element_by_id("search_input") # print(input) input.send_keys('Python') time.sleep(1) # button = wait.until(EC.presence_of_all_elements_located((By.ID, 'earch_button'))) button = browser.find_element_by_id("search_button") button.click() source = browser.page_source html = etree.HTML(source) link = html.xpath('//a[@class="position_link"]/@href') url_list = [] for position_url in link: url_list.append(position_url) browser.quit() return url_list def parse_page(position_url): db = pymysql.connect(host='localhost', user='root', password='123456', port=3306, db='mysql') cursor = db.cursor() driver = webdriver.Chrome() driver.get(position_url) text = driver.page_source html = etree.HTML(text) job_info = {} name = html.xpath('//div[@class="position-content "]//span[@class="name"]/text()')[0] salary = html.xpath('//div[@class="position-content "]//span[@class="salary"]/text()')[0].strip() # job_request = html.xpath('//div[@class="position-content "]//dd[@class="job_request"]/p/span/text()') # print(job_request) drrs = html.xpath('//div[@class="position-content "]//dd[@class="job_request"]/p/span[2]/text()')[0].split('/')[1].strip() years = html.xpath('//div[@class="position-content "]//dd[@class="job_request"]/p/span[3]/text()')[0].split('/')[0].strip() jingyan = html.xpath('//div[@class="position-content "]//dd[@class="job_request"]/p/span[4]/text()')[0].split('/')[0].strip() zhiye = html.xpath('//div[@class="position-content "]//dd[@class="job_request"]/p/span[5]/text()')[0] company = html.xpath('//div[@class="job_company_content"]//em[@class="fl-cn"]/text()')[0].strip() infos = html.xpath('//div[@class="job-detail"]/p/text()')[:] infos = ''.join(infos) job_info['职位'] = name job_info['薪水'] = salary job_info['地址'] = drrs job_info['工作年限'] = years job_info['经验'] = jingyan job_info['是否全职'] = zhiye job_info['公司名称'] = company job_info['职位描述'] = infos sql = '''INSERT INTO func(names, salarys, drrss , yearss, jingyans, zhiyes, companys, infoss) values('%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s')'''\ % (name, salary, drrs, years, jingyan, zhiye, company, infos) sql = sql.encode('utf-8') cursor.execute(sql) db.commit() db.close() print(job_info) driver.quit() def main(): url_list = get_page() for positioin_url in url_list: parse_page(positioin_url) main()
相关文章推荐
- python爬虫实战(九)--------拉勾网全站职位(CrawlSpider)
- python爬虫实战--selenium验证码保存+多线程多标签+自动点击+完整代码
- python3 [爬虫实战] 微博爬虫京东客服之Selenium + Chrom浏览器的使用(上)
- Python爬虫入门实战八:数据储存——MongoDB与MySQL
- 爬虫实战:爬取前程无忧(51job)python相关职位信息
- python爬虫实战-爬取糗图图片并保存至本地文件夹(正则)
- python爬虫实战(1)抓取网页图片自动保存
- python3[爬虫实战] 使用selenium,xpath爬取京东手机(下)
- python爬虫,获取拉勾网职位信息,修改网上旧版不能用的问题
- python selenium爬虫实践:获取自如租房数据保存到文件
- python selenium登录企名片,筛选各种条件,爬虫爬取融资信息,保存到csv文件
- python3爬虫selenium+chrom爬取今日头条热点新闻保存到数据库
- python3 [爬虫入门实战] 爬虫之使用selenium 爬取百度招聘内容并存mongodb
- 7.1 python拉勾网实战并保存到mongodb
- python爬虫实战-爬取美女图片并保存至本地文件夹(xpath)
- Python爬虫框架Scrapy实战之定向批量获取职位招聘信息
- python爬虫实战笔记---selenium爬取QQ空间说说并存至本地(上)
- Python爬虫实战:抓取并保存百度云资源(附代码)
- python爬虫selenium+firefox抓取动态网页--表情包爬虫实战
- 芝麻HTTP:Python爬虫实战之抓取爱问知识人问题并保存至数据库