您的位置:首页 > 编程语言 > Python开发

scrapy实战——利用CrawlSpider爬取腾讯社招全部岗位信息(进行有一定深度的爬取)

2018-01-30 16:29 591 查看
经过scrapy的简单学习,我们实现这样一个爬虫:爬取腾讯社招的全部岗位信息,将粗略的大致信息保存在tencent.json文件中,将岗位的进一步具体信息(职责、要求)保存在positiondescribe.json文件中。

即,我们需要两个item进行页面信息的保存,同时要继承CrawlSpider对页面链接进行相应提取。

项目目录如下:(创建名为TencntSpider的项目)

TencentSpider
│  items.py
│  middlewares.py
│  pipelines.py
│  settings.py
│  __init__.py
│
├─spiders
│  │  tencent.py
│  │  __init__.py
│  │
│  └─__pycache__
│          tencent.cpython-36.pyc
│          __init__.cpython-36.pyc
│
└─__pycache__
items.cpython-36.pyc
pipelines.cpython-36.pyc
settings.cpython-36.pyc
__init__.cpython-36.pyc


难点主要在:

1. 对多个item的处理:在pipelines文件中对传入的item做判断:利用class.name的方法可以对类名进行判断!

2. 对于spider中rules的书写,要清楚我们需要过滤的链接或页面。

3. 对于爬虫文件中parse方法的书写:要记得一旦继承了CrawlSpider类便不能再重写parse方法,我们要自己编写parse方法,因为我们提取了链接之后,要对链接进行跟进处理,进入详细信息的页面,所以我们要写两个parse方法!

items.py代码如下:

# items.py
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html 
import scrapy

class TencentspiderItem(scrapy.Item):
# 职位名
positionName = scrapy.Field()
# 详情链接
positionLink = scrapy.Field()
# 职位类别
positionType = scrapy.Field()
# 招聘人数
peopleNum = scrapy.Field()
# 工作地点
workLocation = scrapy.Field()
# 发布时间
publishTime = scrapy.Field()

class PositionDescribe(scrapy.Item):
# 职位名
positionName = scrapy.Field()
# 职位类别
positionType = scrapy.Field()
# 招聘人数
peopleNum = scrapy.Field()
# 工作地点
workLocation = scrapy.Field()
# 职责
duty = scrapy.Field()
# 要求
requirement = scrapy.Field()


tencent.py代码如下:

# tencent.py
# -*- coding: utf-8 -*-
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from TencentSpider.items import TencentspiderItem
from TencentSpider.items import PositionDescribe

class TencentSpider(CrawlSpider):
name = 'tencent'
allowed_domains = ['hr.tencent.com']
start_urls = ['https://hr.tencent.com/position.php?&start=0#a']

rules = (
Rule(LinkExtractor(allow=r'&start=\d+'), callback='tencentParse', follow=True),
Rule(LinkExtractor(allow=r'/position_detail.php?'), callback='positionParse', follow=True)
)

def tencentParse(self, response):
jobs_list = response.xpath('//tr[@class="even"or@class="odd"]')

for node in jobs_list:
item = TencentspiderItem()
name = node.xpath('./td[1]/a/text()').extract()[0]
link = node.xpath('./td[1]/a/@href').extract()[0]
type = ''.join(node.xpath('./td[2]/text()').extract())
num = node.xpath('./td[3]/text()').extract()[0]
location = node.xpath('./td[4]/text()').extract()[0]
date = node.xpath('./td[5]/text()').extract()[0]
item['positionName'] = name
item['positionLink'] = 'https://hr.tencent.com/' + str(link)
item['positionType'] = type
item['peopleNum'] = num
item['workLocation'] = location
item['publishTime'] = date

yield item

def positionParse(self, response):
item = PositionDescribe()
name = response.xpath('//td[@id="sharetitle"]/text()').extract()
location = response.xpath('//tr[@class="c bottomline"]/td[1]/text()').extract()
type = response.xpath('//tr[@class="c bottomline"]/td[2]/text()').extract()
num = response.xpath('//tr[@class="c bottomline"]/td[3]/text()').extract()
s = ''
duties = response.xpath('//table//tr[3]//ul/li/text()').extract()
for duty in duties:
s += duty
requirements = response.xpath('//table//tr[4]//ul/li/text()').extract()
q = ''
for require in requirements:
q += require

# 职位名
item['positionName'] = name
# 职位类别
item['positionType'] = type
# 招聘人数
item['peopleNum'] = num
# 工作地点
item['workLocation'] = location
# 职责
item['duty'] = s
# 要求
item['requirement'] = q

yield item


pipelines.py代码如下:

# pipelines.py
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html import json

class TencentspiderPipeline(object):

def __init__(self):
# isinstance
self.file = open('tencent.json', 'a', encoding='utf-8')
self.file2 = open('positiondescribe.json', 'a', encoding='utf-8')

def process_item(self, item, spider):
if item.__class__.__name__ == 'TencentspiderItem':
jsontext = json.dumps(dict(item), ensure_ascii=False) + ',\n'
self.file.write(jsontext)
else:
jsontext = json.dumps(dict(item), ensure_ascii=False) + ',\n'
self.file2.write(jsontext)

return item

def close_spider(self, spider):
self.file.close()
self.file2.close()


主要便是这三个文件,同时爬取之后,会生成两个json文件。
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
相关文章推荐