您的位置：首页 > 运维架构 > 网站架构

Scrapy框架——CrawlSpider爬取某招聘信息网站

2018-05-12 10:50 495 查看

CrawlSpider

Scrapy框架中分两类爬虫，Spider类和CrawlSpider类。

它是Spider的派生类，Spider类的设计原则是只爬取start_url列表中的网页，

而CrawlSpider类定义了一些规则(rule)来提供跟进link的方便的机制，从爬取的网页中获取link并继续爬取的工作更适合。

创建项目指令：

scrapy startproject tenCent

CrawlSpider创建：

scrapy genspider -t crawl crawl_tencent "hr.tencent.com"

CrawlSpider继承于Spider类，除了继承过来的属性外（name、allow_domains），还提供了新的属性和方法:

LinkExtractor

from scrapy.linkextractors import LinkExtractor

LinkExtractor(allow=r'start=\d+')

通过实例化LinkExtractor提取链接

主要参数

allow：满足括号中“正则表达式”的值会被提取，如果为空，则全部匹配。

deny：与这个正则表达式(或正则表达式列表)不匹配的URL一定不提取。

rules

Rule(LinkExtractor(allow=r'start=\d+'), callback='parse_tencent', follow=True),

在rules中包含一个或多个Rule对象，每个Rule对爬取网站的动作定义了特定操作。

如果多个rule匹配了相同的链接，则根据规则在本集合中被定义的顺序，第一个会被使用

主要参数

link_extractor：是一个Link Extractor对象，用于定义需要提取的链接
callback： 从link_extractor中每获取到链接时，参数所指定的值作为回调函数，该回调函数接受一个response作为其第一个参数。

注意：当编写爬虫规则时，避免使用parse作为回调函数。由于CrawlSpider使用parse方法来实现其逻辑，如果覆盖了 parse方法，crawl spider将会运行失败。

follow：是一个布尔(boolean)值，指定了根据该规则从response提取的链接是否需要跟进。 如果callback为None，follow 默认设置为True ，否则默认为False。

使用CrawlSpider爬取信息

1.编写item文件

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class TencentItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
# 职位名称
position_name = scrapy.Field()
# 详情链接
position_link = scrapy.Field()
# 职位类别
position_type = scrapy.Field()
# 职位人数
position_number = scrapy.Field()
# 职位地点
work_location = scrapy.Field()
# 发布时间
publish_times = scrapy.Field()
# 工作职责
position_duty = scrapy.Field()
# 工作要求
position_require = scrapy.Field()

class DetailItem(scrapy.Item):
# 工作职责
position_duty = scrapy.Field()
# 工作要求
position_require = scrapy.Field()

2.编写crawlspider文件

# -*- coding: utf-8 -*-
import scrapy
from tenCent.items import TencentItem
from tenCent.items import DetailItem
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

class CrawlTencentSpider(CrawlSpider):
name = 'crawl_tencent'
allowed_domains = ['hr.tencent.com']
start_urls = ['https://hr.tencent.com/position.php']

'''
rule LinkExtractor规则:
allow:根据正则表达式匹配链接
callback:回调函数
follow:是否提取跟进页（链接套链接）
'''
rules = (
Rule(LinkExtractor(allow=r'start=\d+'), callback='parse_tencent', follow=True),
# 从上面的规则传递下一个
Rule(LinkExtractor(allow=r'position_detail\.php\?id=\d+'), callback='parse_detail', follow=False),
)

def parse_tencent(self, response):
print('start……')
node_list = response.xpath('//tr[@class="even"] | //tr[@class="odd"]')
# 选取所有标签tr 且class属性等于even或odd的元素
#i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()
#i['name'] = response.xpath('//div[@id="name"]').extract()
#i['description'] = response.xpath('//div[@id="description"]').extract()
for node in node_list:
item = TencentItem()
item['position_name'] = node.xpath('./td[1]/a/text()').extract_first()  # 获取第一个td标签下a标签的文本
item['position_link'] = node.xpath('./td[1]/a/@href').extract_first()  # 获取第一个td标签下a标签href属性
item['position_type'] = node.xpath('./td[2]/text()').extract_first()  # 获取第二个td标签下文本
item['position_number'] = node.xpath('./td[3]/text()').extract_first()  # 获取第3个td标签下文本
item['work_location'] = node.xpath('./td[4]/text()').extract_first()  # 获取第4个td标签下文本
item['publish_times'] = node.xpath('./td[5]/text()').extract_first()  # 获取第5个td标签下文本

yield item

def parse_detail(self, response):
item = DetailItem()
item['position_duty'] = ''.join(
response.xpath('//ul[@class="squareli"]')[0].xpath('./li/text()').extract())  # 转化为字符串
item['position_require'] = ''.join(
response.xpath('//ul[@class="squareli"]')[1].xpath('./li/text()').extract())  # 转化为字符串

yield item

3.建立pipeline文件

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import json
from .items import TencentItem
from .items import DetailItem

class TencentPipeline(object):
def open_spider(self, spider):
"""
# spider (Spider 对象) – 被开启的spider
# 可选实现，当spider被开启时，这个方法被调用。
:param spider:
:return:
"""
self.file = open('tencent.json', 'w', encoding='utf-8')
json_header = '{ "tencent_info":['
self.count = 0
self.file.write(json_header)  # 保存到文件

def close_spider(self, spider):
"""
# spider (Spider 对象) – 被关闭的spider
# 可选实现，当spider被关闭时，这个方法被调用
:param spider:
:return:
"""
json_tail = '] }'
self.file.seek(self.file.tell() - 1)  # 定位到最后一个逗号
self.file.truncate()  # 截断后面的字符
self.file.write(json_tail)  # 添加终止符保存到文件
self.file.close()

def process_item(self, item, spider):
"""
# item (Item 对象) – 被爬取的item
# spider (Spider 对象) – 爬取该item的spider
# 这个方法必须实现，每个item pipeline组件都需要调用该方法，
# 这个方法必须返回一个 Item 对象，被丢弃的item将不会被之后的pipeline组件所处理。

:param item:
:param spider:
:return:
"""
# print('item=',dict(item))

if isinstance(item, TencentItem):
print('--'*20)
content = json.dumps(dict(item), ensure_ascii=False, indent=2) + ","  # 字典转换json字符串
self.count += 1
print('content', self.count)
self.file.write(content)  # 保存到文件
'''
return item后，item会根据优先级
传递到下一个管道DetailPipeline处理
此段代码说明当实例不属于TencentItem时，放弃存储json，
直接传递到下一个管道处理
return放在if外面，如果写在if里面item在不属于TencentItem实例后，
item会终止传递，造成detail数据丢失
'''
return item

class DetailPipeline(object):
def open_spider(self, spider):
"""
# spider (Spider 对象) – 被开启的spider
# 可选实现，当spider被开启时，这个方法被调用。
:param spider:
:return:
"""
self.file = open('detail.json', 'w', encoding='utf-8')
json_header = '{ "detail_info":['
self.count = 0
self.file.write(json_header)  # 保存到文件

def close_spider(self, spider):
"""
# spider (Spider 对象) – 被关闭的spider
# 可选实现，当spider被关闭时，这个方法被调用
:param spider:
:return:
"""
json_tail = '] }'
self.file.seek(self.file.tell() - 1)  # 定位到最后一个逗号
self.file.truncate()  # 截断后面的字符
self.file.write(json_tail)  # 添加终止符保存到文件
self.file.close()

def process_item(self, item, spider):
"""
# item (Item 对象) – 被爬取的item
# spider (Spider 对象) – 爬取该item的spider
# 这个方法必须实现，每个item pipeline组件都需要调用该方法，
# 这个方法必须返回一个 Item 对象，被丢弃的item将不会被之后的pipeline组件所处理。

:param item:
:param spider:
:return:
"""
# print('item=',dict(item))

if isinstance(item, DetailItem):
'''
得到item,判断item实例属于DetailItem，存储json文件
如果不属于，直接return item到下一个管道
'''
print('**' * 30)
content = json.dumps(dict(item), ensure_ascii=False, indent=2) + ","  # 字典转换json字符串
self.count += 1
print('content', self.count)
self.file.write(content)  # 保存到文件
return item

4.设置settiing

#1、项目名称，默认的USER_AGENT由它来构成，也作为日志记录的日志名
BOT_NAME = 'tenCent'
# 2、爬虫应用路径
SPIDER_MODULES = ['tenCent.spiders']
NEWSPIDER_MODULE = 'tenCent.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = '"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"'  # 头部信息，反爬

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# log日志
LOG_FILE = 'tencent.log'
LOG_LEVEL = 'DEBUG'
LOG_ENCODING = 'utf-8'
LOG_DATEFORMAT='%m/%d/%Y %H:%M:%S %p'

ITEM_PIPELINES = {
'tenCent.pipelines.TencentPipeline': 300,
'tenCent.pipelines.DetailPipeline':400
}

5.执行程序

scrapy crawl crawl_tencent

tencent.log

tencent.json

detail.json

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航