MapReduce-WordCount实现按照value降序排序、字符小写、识别不同标点
2016-05-24 13:24
609 查看
本次是指导本科生实验,做一次简单的实验记录。
实验要求:
输入文件的按照空格、逗号、点号、双引号等分词
输入文件的大写字母全部换成小写
文件输出要求按照value值降序排序
Hadoop给的wordcount示例代码以及代码理解
基于map reduce的word count个人理解:输入的文件经过map reduce框架处理后,将文件分成几份,对于每份文件由独立的job来执行,针对每个job,输入的文件由map按行处理得到相应的输出,中间经过一次shuffle操作,最后经过reduce操作得到输出,输出是按照key的升序排列的。
源代码在Hadoop文件中,Hadoop下载地址:
http://mirror.bit.edu.cn/apache/hadoop/common/ share/hadoop/mapreduce/sources/hadoop-mapreduce-examples-2.7.0-sources.jar org.apache.hadoop.examples.WordCount
wordcount示例代码以及改进
实现输入文件的按照空格、逗号、点号、双引号等分词。在Map函数中:StringTokenizer itr = new
StringTokenizer(value.toString(),” ,.\”:\t\n”);
实现输入文件的大写字母全部换成小写。在Map函数中:word.set(itr.nextToken().toLowerCase());
实现文件输出要求按照value值降序排序。新建一个MapReduce任务,Map负责key-value对换,Reduce默认是按照升序排列的,需要修改默认的排序规则
完整代码如下:
扩展学习
《Hadoop权威指南》一书的相关资料以及实验数据
实验楼前三章: https://www.shiyanlou.com/courses/222
本书代码: http://git.shiyanlou.com/shiyanlou/hadoop-book/src/master
天气完整数据: ftp://ftp.ncdc.noaa.gov/pub/data/noaa/isd-lite/
Hadoop book网址: http://hadoopbook.com/
实验要求:
输入文件的按照空格、逗号、点号、双引号等分词
输入文件的大写字母全部换成小写
文件输出要求按照value值降序排序
Hadoop给的wordcount示例代码以及代码理解
基于map reduce的word count个人理解:输入的文件经过map reduce框架处理后,将文件分成几份,对于每份文件由独立的job来执行,针对每个job,输入的文件由map按行处理得到相应的输出,中间经过一次shuffle操作,最后经过reduce操作得到输出,输出是按照key的升序排列的。
输入文件 | Map操作 | shuffle操作 | reduce操作 | 输出文件 |
Hello,Two | <Hello,1> | <Hello,1,1> | <Hello,2> | <Hello,2> |
<Two,1> | <Two,1> | <Two,1> | <one,1> | |
Hello one | <Hello,1> | <one,1> | <one,1> | <Two,1> |
<one,1> |
http://mirror.bit.edu.cn/apache/hadoop/common/ share/hadoop/mapreduce/sources/hadoop-mapreduce-examples-2.7.0-sources.jar org.apache.hadoop.examples.WordCount
wordcount示例代码以及改进
实现输入文件的按照空格、逗号、点号、双引号等分词。在Map函数中:StringTokenizer itr = new
StringTokenizer(value.toString(),” ,.\”:\t\n”);
实现输入文件的大写字母全部换成小写。在Map函数中:word.set(itr.nextToken().toLowerCase());
实现文件输出要求按照value值降序排序。新建一个MapReduce任务,Map负责key-value对换,Reduce默认是按照升序排列的,需要修改默认的排序规则
完整代码如下:
package org.apache.hadoop.wordcount; import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.io.WritableComparable; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat; import org.apache.hadoop.mapreduce.lib.map.InverseMapper; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat; import org.apache.hadoop.util.GenericOptionsParser; public class WordCount { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()," ,.\":\t\n"); while (itr.hasMoreTokens()) { word.set(itr.nextToken().toLowerCase()); context.write(word, one); } } } /*InverseMapper类的内容参考 public class InverseMapper<K, V> extends Mapper<K,V,V,K> { // The inverse function. Input keys and values are swapped. @Override public void map(K key, V value, Context context ) throws IOException, InterruptedException { context.write(value, key); } } */ public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } private static class IntWritableDecreasingComparator extends IntWritable.Comparator { public int compare(WritableComparable a, WritableComparable b) { return -super.compare(a, b); } public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) { return -super.compare(b1, s1, l1, b2, s2, l2); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); Path tempDir = new Path("wordcount-temp-output"); if (otherArgs.length < 2) { System.err.println("Usage: wordcount <in> [<in>...] <out>"); System.exit(2); } Job job = new Job(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setOutputFormatClass(SequenceFileOutputFormat.class); for (int i = 0; i < otherArgs.length - 1; ++i) { FileInputFormat.addInputPath(job, new Path(otherArgs[i])); } FileOutputFormat.setOutputPath(job,tempDir); job.waitForCompletion(true); Job sortjob = new Job(conf, "sort"); FileInputFormat.addInputPath(sortjob, tempDir); sortjob.setInputFormatClass(SequenceFileInputFormat.class); sortjob.setMapperClass(InverseMapper.class); sortjob.setNumReduceTasks(1); FileOutputFormat.setOutputPath(sortjob, new Path(otherArgs[otherArgs.length - 1])); sortjob.setOutputKeyClass(IntWritable.class); sortjob.setOutputValueClass(Text.class); sortjob.setSortComparatorClass(IntWritableDecreasingComparator.class); sortjob.waitForCompletion(true); FileSystem.get(conf).delete(tempDir); System.exit(0); } }
扩展学习
《Hadoop权威指南》一书的相关资料以及实验数据
实验楼前三章: https://www.shiyanlou.com/courses/222
本书代码: http://git.shiyanlou.com/shiyanlou/hadoop-book/src/master
天气完整数据: ftp://ftp.ncdc.noaa.gov/pub/data/noaa/isd-lite/
Hadoop book网址: http://hadoopbook.com/
相关文章推荐
- 【Arduino官方教程第一辑】示例程序 5-2 For循环迭代(霹雳游侠)
- LeetCode-304.Range Sum Query 2D - Immutable
- UITextField文本框卡控:只能输入数字
- Contiki协议栈:索引目录
- [WinForm]WinForm跨线程UI操作常用控件类大全
- Android(进度条)异步更新UI的三种方式
- UVa 133 The Dole Queue
- UE4动作流程总结
- juery的跨域请求2
- 富客户端 wpf, Winform 多线程更新UI控件
- UITableView 删除最后一行 奔溃错误处理方法
- LeetCode 307. Range Sum Query - Mutable
- java.lang.AbstractMethodError: javax.ws.rs.core.UriBuilder.uri(Ljava/lang/String;)Ljavax/ws/rs/core/
- 【HDU 1005】Number Sequence(规律+水题)
- 去除UITableViewCell分割线的左间隙
- android 在非UI线程更新UI仍然成功原因深入剖析
- Druid使用起步—在javaWeb项目中配置监控 连接池
- 文本框输入时马上弹出搜索界面
- iOS定义UIColor RGB 的宏
- leetcode.128. Longest Consecutive Sequence