scrapy框架爬虫,爬取腾讯职业实例
1.scrapy startproject tencent (最后一个是名字) 创建一个项目
2.写item文件 这是用于存储数据
import scrapy
class TencentItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
#职位名
positionname = scrapy.Field()
#详情链接
positionlink = scrapy.Field()
#职位类别
positionType = scrapy.Field()
#招聘人数
peopleNum = scrapy.Field()
#工作地点
workLocation = scrapy.Field()
#发布时间
publishTime = scrapy.Field()
3.在文件spider 写 命令:scrapy genspider tencent “tencent.com” ( tencent表示名字,“tencent.com” 表示范围)
# -*- coding: utf-8 -*-
import scrapy
from tencent.items import TencentItem
class TencentpositionSpider(scrapy.Spider):
name = 'tencent'
allowed_domains = ['tencent.com']
#start_urls = ['http://tencent.com/']
url = "https://hr.tencent.com/position.php?&start="
offset = 0
start_urls = [url+str(offset)]
def parse(self, response):
for each in response.xpath("//tr[@class='even'] | //tr[@class='odd']"):
#初始化模型对象
item = TencentItem()
#职位名
item['positionname'] = each.xpath("./td[1]/a/text()").extract()[0]
#详情链接
item['positionlink'] = each.xpath("./td[1]/a/@href").extract()[0]
#职位类别
try:
item['positionType'] = each.xpath("./td[2]/text()").extract()[0]
except IndexError:
pass
#招聘人数
item['peopleNum'] = each.xpath("./td[3]/text()").extract()[0]
#工作地点
item['workLocation'] = each.xpath("./td[4]/text()").extract()[0]
#发布时间
item['publishTime'] = each.xpath("./td[5]/text()").extract()[0]
#将数据交给pipeline
yield item
if self.offset < 3000:
self.offset += 10
#每次处理完一页的数据之后,重新发送下一页页面请求
#self.offset自增10,同时拼接为新的url,并调用回调函数self.parse处理Response
yield scrapy.Request(self.url+str(self.offset),callback = self.parse)
4.写setting.py文件 写一个header
DEFAULT_REQUEST_HEADERS = {
'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
}
最好delay一下。DOWNLOAD_DELAY = 3
5.写管道文件,用于处理数据
import json
class TencentPipeline(object):
def __init__(self):#这里是自动执行
self.filename = open("tencent.json","w")
def process_item(self, item, spider):
text = json.dumps(dict(item),ensure_ascii = False) + "\n"
#这里ensure_ascii = False,json.dumps 序列化时对中文默认使用的ascii编码.想输出真正的中文需要指定ensure_ascii=False:
self.filename.write(text.encode("utf-8"))#b以中文的形式写进文件中
return item
def close_spider(self,spider):
self.filename.close()
6.最后运行程序,scrapy crawl tencent
需要下载时间,大家需要等爬取时间
结果类似于这样
"positionname": "25928-腾讯游戏云前端开发高级工程师(深圳)", "publishTime": "2018-08-10", "positionlink": "position_detail.php?id=43288&keywords=&tid=0&lid=0", "peopleNum": "1", "positionType": "技术类", "workLocation": "深圳"}
{"positionname": "25664-智慧城市解决方案销售顾问(深圳)", "publishTime": "2018-08-10", "positionlink": "position_detail.php?id=43282&keywords=&tid=0&lid=0", "peopleNum": "1", "positionType": "技术类", "workLocation": "深圳"}
{"positionname": "TEG02-海外网络运营工程师(深圳)", "publishTime": "2018-08-10", "positionlink": "position_detail.php?id=43284&keywords=&tid=0&lid=0", "peopleNum": "1", "positionType": "技术类", "workLocation": "深圳"}
{"positionname": "SNG03-直播用户运营产品经理(深圳)", "publishTime": "2018-08-10", "positionlink": "position_detail.php?id=43286&keywords=&tid=0&lid=0", "peopleNum": "1", "positionType": "产品/项目类", "
- scrapy爬虫框架入门实例
- Python爬虫框架Scrapy实例
- scrapy爬虫框架实例一,爬取自己博客
- Python爬虫框架Scrapy实例
- python爬虫框架scrapy实例详解
- scrapy爬虫框架入门实例(一)
- python爬虫框架scrapy实例详解
- python爬虫框架scrapy实例详解
- scrapy爬虫框架学习入门教程及实例
- python爬虫框架scrapy实例详解
- scrapy爬虫框架入门实例
- 数据分析——以斗鱼为实例解析requests库与scrapy框架爬虫技术
- python爬虫框架scrapy实例详解
- Python爬虫框架Scrapy实例代码
- scrapy爬虫框架实例二
- 爬虫学习之scrapy框架入门
- Scrapy爬虫框架学习之使用start_requests函数实现用户登录
- Scrapy - 爬虫框架
- 基于Python,scrapy,redis的分布式爬虫实现框架
- Python爬虫框架Scrapy安装使用步骤