您的位置：首页 > 运维架构

hadoop2.5.2学习10--MR之统计每月最高三个温度01

2017-02-08 23:18 309 查看

1、hadoop 之InputFormat

在上篇文章中有一个

job.setInputFormatClass(KeyValueTextInputFormat.class);

,用于设置输入的格式，这个类型中可以设置分隔符。

2、splits and records

数据传到hdfs上，以block形式存在，mapreduce中，源数据被split 分成一个个分片，每个分片有一个mapTask处理，每个分片按照制定格式切割成若干个键值对（records），作为map的的输入。map循环处理这些records。

Split 和rRecord 都是逻辑性的概念。

首先看一下InputSplit

InputSplit是一个抽象类，称为分片，表示每个mapper的输入数据。

InputSplit 包含一个以字节为单位的长度和一组存储位置。分片并不包含数据本身，而是指向数据的引用。存储位置供MapReduce系统使用以便将map任务尽量放在分片数据附近，而分片大小用来排序分片，便于优先处理最大的分片，从而最小化作业时间。

InputSplit的方法：


返回值	方法名	解释
abstract long	getLength()	Get the size of the split, so that the input splits can be sorted by size.
SplitLocationInfo[]	getLocationInfo()	Gets info about which nodes the input split is stored on and how it is stored at each location
abstract String[]	getLocations()	Get the list of nodes by name where the data for the split would be local.

InputFormat

InputFormat负责创建inputSplit, 并将它们拆分成键值对（records），

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： hadoop

相关文章推荐

新的分享

章节导航