您的位置:首页 > 其它

spark统计文献中每个英文单词出现的次数

2016-12-15 10:19 483 查看
实例英文文档

My father was a self-taught mandolin player. He was one of the best string instrument players in our town. He could not read music, but if he heard a tune a few times, he could play it. When he was younger, he was a member of a small country music b
A A A A A A A A A  A A A A A A A A
B B B B B B B B BB B B BB B B B B
C C C C C C C C C C C
D D D D D D D D D D D


统计程序:统计文档中每个单词出现的次数

/**
* Created by hbin on 2016/12/9.
*/
import java.util.Arrays;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.*;
import scala.Boolean;
import scala.Tuple2;

/**
* spark对数据的核心抽象  RDD(弹性分布式数据集)
* RDD就是分布式的元素集合,在spark中对数据的所有操作不外乎创建RDD
* 转化已有RDD以及调用RDD操作进行求值,spark会自动将RDD中的数据分发到集群上,
* 并将操作并行化
*/
public class BasicMap {
public static void main(String[] args) throws Exception {
SparkConf sparkConf = new SparkConf().setAppName("JavaSparkPi");
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
JavaRDD<String> input=jsc.textFile("E:\\sparkProject\\log.txt");
//flatMap 将行数据切分为单词
JavaRDD<String> words=input.flatMap(new FlatMapFunction<String, String>() {
@Override
public Iterable<String> call(String s) throws Exception {
return Arrays.asList(s.split(" "));
}
});

JavaPairRDD<String,Integer> result=words.mapToPair(new PairFunction<String, String, Integer>() {
@Override
public Tuple2<String, Integer> call(String s) throws Exception {
return new Tuple2(s,1);
}
}).reduceByKey(new Function2<Integer, Integer, Integer>() {//合并具有相同键的值
@Override
public Integer call(Integer a, Integer b) throws Exception {
return a+b;//键相同,则对应的值相加
}
});

System.out.println("result="+result.collect());
}
}


执行结果:

result=[(tune,1), (play,1), (string,1), (younger,,1), (He,2), (not,1), (country,1), (few,1), (heard,1), (small,1), (players,1), (town.,1), (if,1), (it.,1), (B,14), (a,5), (BB,2), (was,4), (b,1), (one,1), (A,17), (When,1), (could,2), (our,1), (best,1), (,1),
(he,4), (in,1), (member,1), (music,,1), (self-taught,1), (of,2), (music,1), (father,1), (times,,1), (mandolin,1), (read,1), (C,11), (player.,1), (My,1), (but,1), (instrument,1), (D,11), (the,1)]
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: