Apache Hadoop MapReduce WordCount案例编程入门
2019-03-13 20:38
609 查看
一.MapReduce 简介
MapReduce作为Hadoop的三大组件(功能上分)之一,主要为提供大数据平台的分布式计算,虽然比较臃肿,只适合处理离线处理,但是对于理解spark等框架的原理架构会有很大帮助。
二.WordCount案例编写
为了测试方便,因此直接在windows10本地测试本案例
1.准备阶段
1)数据准备
wordCountdemo.rar 解压到某个文夹下,例如本例中解压到:D:\mktest
2)Jar包准备(Maven配置)
<dependencies> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>2.7.6</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-hdfs</artifactId> <version>2.7.6</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-mapreduce-client-core</artifactId> <version>2.7.6</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-mapreduce-client-common</artifactId> <version>2.7.6</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-yarn</artifactId> <version>2.7.6</version> </dependency> </dependencies>
即:
hadoop-common,hadoop-hdfs,hadoop-mapreduce-client-core,hadoop-mapreduce-client-common和hadoop-yarn(Maven 形式的话,会自动下载其所依赖的jar包)
3)加入log4j.properties日志配置文件(src下面)
###set log levels### log4j.rootLogger=info, stdout ###output to the console### log4j.appender.stdout=org.apache.log4j.ConsoleAppender log4j.appender.stdout.Target=System.out log4j.appender.stdout.layout=org.apache.log4j.PatternLayout log4j.appender.stdout.layout.ConversionPattern=[%d{dd/MM/yy HH:mm:ss:SSS z}] %t %5p %c{2}: %m%n
2.WordCount代码实现
结构:
自定义Mapper,自定义Reducer,Driver
1)自定义Mapper类
创建WordCountMapper类,继承Mapper类
由于五个文件中都是以tab键分割的。
package com.mycat.mapreduce.wordcount; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; import java.io.IOException; public class WordCountMapper extends Mapper<LongWritable, Text,Text, IntWritable> { // 定义变量作为map输出的key Text sendKey=new Text(); // 定义变量作为map输出的value IntWritable sendValue=new IntWritable(); //由于map方法的调用频率是每一行,即按行调用,故粒度操作可细化为对一行的操作 /** * 参数一(key):每一行的行首偏移量,与每一行的缩占用字节数量息息相关. * 参数二(value):即每一行具体的内容 * 参数三(context):上下文对象,上用来承接前面框架接口,下来作为向下一层输出的接口 * LongWritable,Text是实现了可序列化接口的类(分别对应java的long和String类型) * 之所以要传递序列化的类型是因为分布式计算需要通过网络实现数据的传输, * 为什么不使用Java默认的Serializable接口? * java默认的序列化接口的好处是兼容性强,但是序列化与反序列化性能方面却很差,Hadoop默认采用 * Writable接口实现对象的序列化和反序列化。 */ @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { //针对,每一行的多个单词按照tab键进行分割 String[] lines = value.toString().split("\t"); //对每一个单词做一个标记,设置其值为1,然后通过context对象向下一层传递 for (String word : lines) { sendKey.set(word); sendValue.set(1); context.write(sendKey,sendValue); } } }
2)自定义Reducer类
创建WordCountReducer类,继承Reducer类
注意:Text类型所在包是
org.apache.hadoop.io.Text
package com.mycat.mapreduce.wordcount; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import 20000 org.apache.hadoop.mapreduce.Reducer; import java.io.IOException; /** * reducer的调用频率是每组一次(按照分组) * 参数1(key):类型必须和Mapper的key类型保持一致,其值对应于mapper的key * 参数2(values):对mapper的value进行排序分组聚合后的迭代器对象 * 参数3(context):上下文对象,承上启下,节结果交给下层输出接口 */ public class WordCountReducer extends Reducer<Text, IntWritable,Text,IntWritable> { /** * * @param key 例如 hello * @param values 1,1,1,1,1 * @param context 上下文对象 */ @Override protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { //定义来汇总一组值的变量 int sum=0; //迭代遍历values for (IntWritable value : values) { sum+=value.get(); } context.write(key, new IntWritable(sum)); } }
3) 创建Driver驱动类
package com.mycat.mapreduce.wordcount; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import java.io.IOException; public class Driver { public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException { //通过Configuration创建作业对象 Configuration conf = new Configuration(); Job job=Job.getInstance(conf); //指定打成jar包后主类入口 job.setJarByClass(Driver.class); //指定Mapper类 job.setMapperClass(WordCountMapper.class); job.setReducerClass(WordCountReducer.class); //指定自定义的mapper类的输出键值类型 job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); //指定自定义reducer类的输出键值类型 job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(("D:\\mktest\\wordCountdemo"))); FileSystem fs = FileSystem.get(conf); Path path=new Path("D://mktest/wordcount"); //输出目录要求不能存在,不然会报错,下面判断为:如果该目录存在该目录直接级联删除(方便测试) if (fs.exists(path)) { fs.delete(path, true); } FileOutputFormat.setOutputPath(job, path); //提交作业 job.waitForCompletion(true); } }
3.结果展示
1) 控制台输出
13/03/19 20:14:30:415 CST] main INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local965530918_0001 ................................................. ................................................. [13/03/19 20:14:31:562 CST] main INFO mapreduce.Job: map 100% reduce 100% [13/03/19 20:14:31:563 CST] main INFO mapreduce.Job: Job job_local965530918_0001 completed successfully [13/03/19 20:14:31:573 CST] main INFO mapreduce.Job: Counters: 30 File System Counters FILE: Number of bytes read=14818 FILE: Number of bytes written=1758716 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 Map-Reduce Framework Map input records=33 Map output records=77 Map output bytes=723 Map output materialized bytes=907 Input split bytes=500 Combine input records=0 Combine output records=0 Reduce input groups=13 Reduce shuffle bytes=907 Reduce input records=77 Reduce output records=13 Spilled Records=154 Shuffled Maps =5 Failed Shuffles=0 Merged Map outputs=5 GC time elapsed (ms)=0 Total committed heap usage (bytes)=3019898880 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=438 File Output Format Counters Bytes Written=117
2)输出结果文件夹
# D:\mktest\wordcount 的目录 2019/03/13 20:14 12 .part-r-00000.crc 2019/03/13 20:14 8 ._SUCCESS.crc 2019/03/13 20:14 105 part-r-00000 2019/03/13 20:14 0 _SUCCESS
3)输出文件介绍
.part-r-00000.crc:结果文件的校验文件._SUCCESS.crc:结果成功标识文件的校验文件part-r-00000:输出结果文件(因为默认只有一个reducetask所以只有一个结果输出文件)_SUCCESS:成功标识文件
4) 结果输出文件查看(第一行那个数字格式是模拟测试数据时不小心保存错了,但是一样测试
)
结果输出格式:单词出现次数
00:0c:29:16:90 1 fer 4 fhieu 4 fjeir 4 fjir 4 fre 8 hdf 8 hdfs 4 hds 4 hello 16 hfureh 4 word 4 world 12
相关文章推荐
- Hadoop MapReduce编程 API入门系列之wordcount版本3(七)
- Hadoop MapReduce编程 API入门系列之wordcount版本4(八)
- Hadoop MapReduce编程 API入门系列之wordcount版本1(五)
- Hadoop MapReduce编程 API入门系列之wordcount版本5(九)
- Hadoop MapReduce编程 API入门系列之wordcount版本2(六)
- Hadoop MapReduce编程入门案例
- 知识学习——Hadoop MapReduce开发入门程序WordCount详解
- Hadoop集群_WordCount运行详解--MapReduce编程模型
- Hadoop基础教程-第6章 MapReduce入门(6.2 解读WordCount)(草稿)
- CentOS虚拟机Java环境中MapReduce Hadoop的WordCount(词频运算)程序连接数据入门
- (HADOOP入门)mapreduce入门程序wordcount旧版API
- Hadoop编程入门,统计单词出现数目wordcount
- hadoop入门之wordcount小案例
- mapreduce入门案例wordcount
- MapReduce 编程实战之WordCount案例详细分析
- Hadoop基础教程-第6章 MapReduce入门(6.3 加速WordCount)(草稿)
- Hadoop入门案例(一) wordcount
- Hadoop(4-2)-MapReduce程序案例-WordCount(Intellij Idea环境)
- [转]Hadoop集群_WordCount运行详解--MapReduce编程模型
- hadoop之MapReduce基本原理及入门WordCount小例子(三)