您的位置：首页 > 运维架构 > Linux

在CentOS 运行你的第一个MapReduce程序

2014-11-26 14:14 316 查看

在进行本文的操作之前要先搭建一个Hadoop的环境，为了便于实验，可采用单节点部署的方式，具体方法可参见：Centos 6.5 下Hadoop 1.2.1单节点环境的创建

编写源码

主要为创建一个解析气象数据的程序，可以从数据文件中选择气温最高的一年，采用Maven进行编译。下面只包含Maper,Reduce,以及Main函数的代码。完整项目代码请参见
https://github.com/Eric-aihua/practise/tree/master/hadoop

Mapper

package com.eric.hadoop.map;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;

public class MaxTemperatureMapper extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable> {

private static final int MISSING = 9999;

public void map(LongWritable fileOffset, Text lineRecord,
OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
System.out.println("##Processing Record:" + lineRecord.toString());
String line = lineRecord.toString();
String year = line.substring(15, 19);
int temperature;
if (line.charAt(87) == '+') {
temperature = Integer.parseInt(line.substring(88, 92));
} else {
temperature = Integer.parseInt(line.substring(87, 92));
}
String quality = line.substring(92, 93);
if (temperature != MISSING && quality.matches("[01459]")) {
output.collect(new Text(year), new IntWritable(temperature));
}
}

}

Reduce

package com.eric.hadoop.reduce;

import java.io.IOException;
import java.util.Iterator;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;

public class MaxTemperatureReduce extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text year, Iterator<IntWritable> temperatures,
OutputCollector<Text, IntWritable> output, Reporter arg3) throws IOException {
int maxTemperature = Integer.MIN_VALUE;
System.out.println("##Processing temperatures:" + temperatures);
while (temperatures.hasNext()) {
maxTemperature = Math.max(maxTemperature, temperatures.next().get());
}
output.collect(year, new IntWritable(maxTemperature));
}

}

Main

package com.eric.hadoop.jobconfig;

import java.io.IOException;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;

import com.eric.hadoop.map.MaxTemperatureMapper;
import com.eric.hadoop.reduce.MaxTemperatureReduce;

public class MaxTemperature {
public static void main(String[] args) throws IOException {
JobConf conf = new JobConf(MaxTemperature.class);
conf.setJobName("Get Max Temperature!");
if (args.length != 2) {
System.err.print("Must contain 2 params:inputPath OutputPath");
System.exit(0);
}

FileInputFormat.addInputPaths(conf, args[0]);
FileOutputFormat.setOutputPath(conf, new Path(args[1]));

conf.setMapperClass(MaxTemperatureMapper.class);
conf.setReducerClass(MaxTemperatureReduce.class);

conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
JobClient.runJob(conf);
}
}

生成Jar文件

进入项目目录，执行

mvn install

成功执行后生成名称为hadoop-0.0.1-SNAPSHOT.jar的Jar文件

获取测试数据

可以使用上文中github中的数据，也可从互联网上下载，URL为：https://github.com/tomwhite/hadoop-book/tree/master/input/ncdc/all

假设下载的数据文件名称为1902,且放到HDFS文件系统的testdata目录

hadoop dfs -mkdir testdata
hadoop dfs -mkdir output

hadoop dfs -put 1902 testdata

执行Job

hadoop jar hadoop-0.0.1-SNAPSHOT.jar testdata/1902 output

观察结果

通过WEB控制台来监控：

通过命令行输出来监控：

[hadoop@localhost ~]$ hadoop jar hadoop-0.0.1-SNAPSHOT.jar testdata/1902 output

Warning: $HADOOP_HOME is deprecated.

14/11/26 13:33:39 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.

14/11/26 13:33:39 INFO util.NativeCodeLoader: Loaded the native-hadoop library

14/11/26 13:33:39 WARN snappy.LoadSnappy: Snappy native library not loaded

14/11/26 13:33:39 INFO mapred.FileInputFormat: Total input paths to process : 1

14/11/26 13:33:40 INFO mapred.JobClient: Running job: job_201411261331_0002 #job的标识

14/11/26 13:33:41 INFO mapred.JobClient: map 0% reduce 0%

14/11/26 13:33:47 INFO mapred.JobClient: map 100% reduce 0% #Mapper的进度

14/11/26 13:33:54 INFO mapred.JobClient: map 100% reduce 33%

14/11/26 13:33:56 INFO mapred.JobClient: map 100% reduce 100%[b]#Reduce的进度[/b]

14/11/26 13:33:57 INFO mapred.JobClient: Job complete: job_201411261331_0002

14/11/26 13:33:57 INFO mapred.JobClient: Counters: 30

14/11/26 13:33:57 INFO mapred.JobClient: Job Counters

14/11/26 13:33:57 INFO mapred.JobClient: Launched reduce tasks=1

14/11/26 13:33:57 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=7744

14/11/26 13:33:57 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0

14/11/26 13:33:57 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0

14/11/26 13:33:57 INFO mapred.JobClient: Launched map tasks=2

14/11/26 13:33:57 INFO mapred.JobClient: Data-local map tasks=2

14/11/26 13:33:57 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=9008

14/11/26 13:33:57 INFO mapred.JobClient: File Input Format Counters

14/11/26 13:33:57 INFO mapred.JobClient: Bytes Read=890953

14/11/26 13:33:57 INFO mapred.JobClient: File Output Format Counters

14/11/26 13:33:57 INFO mapred.JobClient: Bytes Written=9

14/11/26 13:33:57 INFO mapred.JobClient: FileSystemCounters

14/11/26 13:33:57 INFO mapred.JobClient: FILE_BYTES_READ=72221

14/11/26 13:33:57 INFO mapred.JobClient: HDFS_BYTES_READ=891143

14/11/26 13:33:57 INFO mapred.JobClient: FILE_BYTES_WRITTEN=309368

14/11/26 13:33:57 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=9

14/11/26 13:33:57 INFO mapred.JobClient: Map-Reduce Framework

14/11/26 13:33:57 INFO mapred.JobClient: Map output materialized bytes=72227

14/11/26 13:33:57 INFO mapred.JobClient: Map input records=6565 #Mapper的输入记录数

14/11/26 13:33:57 INFO mapred.JobClient: Reduce shuffle bytes=72227

14/11/26 13:33:57 INFO mapred.JobClient: Spilled Records=13130

14/11/26 13:33:57 INFO mapred.JobClient: Map output bytes=59085

14/11/26 13:33:57 INFO mapred.JobClient: Total committed heap usage (bytes)=478543872

14/11/26 13:33:57 INFO mapred.JobClient: CPU time spent (ms)=4400 #CPU耗时

14/11/26 13:33:57 INFO mapred.JobClient: Map input bytes=888978

14/11/26 13:33:57 INFO mapred.JobClient: SPLIT_RAW_BYTES=190

14/11/26 13:33:57 INFO mapred.JobClient: Combine input records=0

14/11/26 13:33:57 INFO mapred.JobClient: Reduce input records=6565 [b][b]#Reduce的输出记录数[/b][/b]

14/11/26 13:33:57 INFO mapred.JobClient: Reduce input groups=1

14/11/26 13:33:57 INFO mapred.JobClient: Combine output records=0

14/11/26 13:33:57 INFO mapred.JobClient: Physical memory (bytes) snapshot=501690368

14/11/26 13:33:57 INFO mapred.JobClient: Reduce output records=1 [b][b][b]#Reduce的输出记录数[/b][/b][/b]

14/11/26 13:33:57 INFO mapred.JobClient: Virtual memory (bytes) snapshot=2167922688

14/11/26 13:33:57 INFO mapred.JobClient: Map output records=6565[b]#Mapper的输出记录数
[/b]

检查运行结果

故障以及解析

问题描述：hadoop 的map阶段正常，但是reduce却卡在00%那里，等了好久进度仍然不变

日志报错：2011-10-03 09:46:13,349 INFO org.apache.hadoop.mapred.JobInProgress: Failed
fetch notification #1 for task attempt_201110022127_0003_m_000000_0

1. 将/etc/hosts中的主机名与/etc/sysconfig/network中的HOSTNAME一致，修改对应的文件后重启系统

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： CentOS Hadoop MaxTemperature Failed fetch notific Maven

相关文章推荐

新的分享

章节导航