您的位置:首页 > 职场人生

scrapy框架爬虫,爬取腾讯职业实例

2018-08-10 13:56 316 查看

1.scrapy startproject tencent (最后一个是名字) 创建一个项目

2.写item文件  这是用于存储数据

import scrapy
 
class TencentItem(scrapy.Item):
     # define the fields for your item here like:
     # name = scrapy.Field()
 
     #职位名
     positionname = scrapy.Field()
     #详情链接
     positionlink = scrapy.Field()
     #职位类别
     positionType = scrapy.Field()
     #招聘人数
     peopleNum = scrapy.Field()
     #工作地点
     workLocation = scrapy.Field()
     #发布时间
     publishTime = scrapy.Field()

3.在文件spider 写 命令:scrapy genspider tencent “tencent.com” ( tencent表示名字,“tencent.com” 表示范围)

# -*- coding: utf-8 -*-

import scrapy
from tencent.items import TencentItem
 
class TencentpositionSpider(scrapy.Spider):
     name = 'tencent'
     allowed_domains = ['tencent.com']
     #start_urls = ['http://tencent.com/']
     url = "https://hr.tencent.com/position.php?&start="
     offset = 0
     start_urls = [url+str(offset)]
     def parse(self, response):
         for each in response.xpath("//tr[@class='even'] | //tr[@class='odd']"):
             #初始化模型对象
             item = TencentItem()

             #职位名
             item['positionname'] = each.xpath("./td[1]/a/text()").extract()[0]
             #详情链接
             item['positionlink'] = each.xpath("./td[1]/a/@href").extract()[0]
              #职位类别
             try:
                 item['positionType'] = each.xpath("./td[2]/text()").extract()[0]
             except IndexError:
                 pass
            #招聘人数
             item['peopleNum'] = each.xpath("./td[3]/text()").extract()[0]
             #工作地点
             item['workLocation'] = each.xpath("./td[4]/text()").extract()[0]
             #发布时间
             item['publishTime'] = each.xpath("./td[5]/text()").extract()[0]
             #将数据交给pipeline
             yield item
 
         if self.offset < 3000:
              self.offset += 10
 
        #每次处理完一页的数据之后,重新发送下一页页面请求
          #self.offset自增10,同时拼接为新的url,并调用回调函数self.parse处理Response
         yield scrapy.Request(self.url+str(self.offset),callback = self.parse)
4.写setting.py文件 写一个header

DEFAULT_REQUEST_HEADERS = {
   'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0',
   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
}

最好delay一下。DOWNLOAD_DELAY = 3

5.写管道文件,用于处理数据

import json
class TencentPipeline(object):
    def __init__(self):#这里是自动执行
        self.filename = open("tencent.json","w")
       
    def process_item(self, item, spider):
        text = json.dumps(dict(item),ensure_ascii = False) + "\n"

#这里ensure_ascii = False,json.dumps 序列化时对中文默认使用的ascii编码.想输出真正的中文需要指定ensure_ascii=False:

 


        self.filename.write(text.encode("utf-8"))#b以中文的形式写进文件中
        return item
    def close_spider(self,spider):
        self.filename.close()

6.最后运行程序,scrapy crawl tencent

需要下载时间,大家需要等爬取时间

结果类似于这样

"positionname": "25928-腾讯游戏云前端开发高级工程师(深圳)", "publishTime": "2018-08-10", "positionlink": "position_detail.php?id=43288&keywords=&tid=0&lid=0", "peopleNum": "1", "positionType": "技术类", "workLocation": "深圳"}
{"positionname": "25664-智慧城市解决方案销售顾问(深圳)", "publishTime": "2018-08-10", "positionlink": "position_detail.php?id=43282&keywords=&tid=0&lid=0", "peopleNum": "1", "positionType": "技术类", "workLocation": "深圳"}
{"positionname": "TEG02-海外网络运营工程师(深圳)", "publishTime": "2018-08-10", "positionlink": "position_detail.php?id=43284&keywords=&tid=0&lid=0", "peopleNum": "1", "positionType": "技术类", "workLocation": "深圳"}
{"positionname": "SNG03-直播用户运营产品经理(深圳)", "publishTime": "2018-08-10", "positionlink": "position_detail.php?id=43286&keywords=&tid=0&lid=0", "peopleNum": "1", "positionType": "产品/项目类", "

阅读更多
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: