您的位置:首页 > 大数据

大数据-Hadoop-MapReduce (二) WrodCount单词计算

2017-10-29 12:39 337 查看


Hadoop-MapReduce (二) -WrodCount单词计算

一句话理解: 将很多很多的文本文件遍历计算出每一个单词出现的次数

-扩展阅读TF-IDF词频-逆向文档频率


(WordCount).单词计算

有文本如下:
a b c
b b c
c d c
需得到结果为:
a 1
b 3
c 4
d 1 
原理如图:



1)Map 将每一行的单词计数为1 Map<word,1>

// 输入为一行行的数据 其中 LongWritable key为下标,Text value 为这一行文本
// 假设这一行数据为 b c d e e e e
public static class TokenizerMapper extends Mapper {
protected void map(LongWritable key, Text value, org.apache.hadoop.mapreduce.Mapper.Context context)
throws IOException, InterruptedException {
String lineStr = value.toString();// 得到一行文本
// 使用空格分离 默认参数为空格
StringTokenizer words = new StringTokenizer(lineStr);
while (words.hasMoreElements()) {
String word = words.nextToken();// 得到这个单词
//if(word.contains("Maturity"))
// 交这个单词计数+1
context.write(new Text(word), new IntWritable(1));// 输出到map
}
}
}


2)Shuffling 对每一个单词进行分类合并 Map<word,<1,1>>
3)Reduce 对每一个单词累加 word = 1 + 1

// input e1 e1 e1 e1
// output e4
//public static class IntSumReducer extends Reducer {
public static class IntSumReducer extends Reducer {
public void reduce(Text key, Iterable values, Reducer.Context context) throws IOException, InterruptedException {
int count = 0;
// String word = key.toString();
for (IntWritable intWritable : values) {
// 循环
count += intWritable.get();
}
// 输出
context.write(key, new IntWritable(count));
}
}

4)Job运算
public class WordCount {

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();
String inputPath = "input/wordcount";
String outputPath = "output/wordcount";
// String[] otherArgs = (new GenericOptionsParser(conf,
// args)).getRemainingArgs();
String[] otherArgs = new String[] { inputPath, outputPath }; /* 直接设置输入参数 */
// delete output
Path outputPath2 = new Path(outputPath);
outputPath2.getFileSystem(conf).delete(outputPath2, true);

// run
if (otherArgs.length < 2) {
System.err.println("Usage: wordcount  [...] ");
System.exit(2);
}

Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(WordCount.TokenizerMapper.class);
//job.setCombinerClass(WordCount.IntSumReducer.class);
job.setReducerClass(WordCount.IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

//output file total
//job.setNumReduceTasks(1);//reducer task num
for (int i = 0; i < otherArgs.length - 1; ++i) {
FileInputFormat.addInputPath(job, new Path(otherArgs[i]));
}

FileOutputFormat.setOutputPath(job, new Path(otherArgs[otherArgs.length - 1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}


转载请注明出处,谢谢!
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  mapreduce hadoop 大数据
相关文章推荐