3.3 Spark RDD 键值转换操作3-groupByKey、reduceByKey、reduceByKeyLocally
2017-10-28 00:14
513 查看
1 groupByKey
def groupByKey(): RDD[(K, Iterable[V])]
def groupByKey(numPartitions: Int): RDD[(K, Iterable[V])]
def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])]
该函数用于将RDD[K,V]中每个K对应的V值,合并到一个集合Iterable[V]中,
参数numPartitions用于指定分区数;
参数partitioner用于指定分区函数;
例子:
2 reduceByKey
def reduceByKey(func: (V, V) => V): RDD[(K, V)]
def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)]
def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)]
该函数用于将RDD[K,V]中每个K对应的V值根据映射函数来运算。
参数numPartitions用于指定分区数;
参数partitioner用于指定分区函数;
例子:
3 reduceByKeyLocally
def reduceByKeyLocally(func: (V, V) => V): Map[K, V]
该函数将RDD[K,V]中每个K对应的V值根据映射函数来运算,运算结果映射到一个Map[K,V]中,而不是RDD[K,V]。
例子:
def groupByKey(): RDD[(K, Iterable[V])]
def groupByKey(numPartitions: Int): RDD[(K, Iterable[V])]
def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])]
该函数用于将RDD[K,V]中每个K对应的V值,合并到一个集合Iterable[V]中,
参数numPartitions用于指定分区数;
参数partitioner用于指定分区函数;
例子:
scala> var rdd1 = sc.makeRDD(Array(("A",0),("A",2),("B",1),("B",2),("C",1))) rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[89] at makeRDD at :21 scala> rdd1.groupByKey().collect res81: Array[(String, Iterable[Int])] = Array((A,CompactBuffer(0, 2)), (B,CompactBuffer(2, 1)), (C,CompactBuffer(1)))
2 reduceByKey
def reduceByKey(func: (V, V) => V): RDD[(K, V)]
def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)]
def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)]
该函数用于将RDD[K,V]中每个K对应的V值根据映射函数来运算。
参数numPartitions用于指定分区数;
参数partitioner用于指定分区函数;
例子:
scala> var rdd1 = sc.makeRDD(Array(("A",0),("A",2),("B",1),("B",2),("C",1))) rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[91] at makeRDD at :21 scala> rdd1.partitions.size res82: Int = 15 scala> var rdd2 = rdd1.reduceByKey((x,y) => x + y) rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[94] at reduceByKey at :23 scala> rdd2.collect res85: Array[(String, Int)] = Array((A,2), (B,3), (C,1)) scala> rdd2.partitions.size res86: Int = 15 scala> var rdd2 = rdd1.reduceByKey(new org.apache.spark.HashPartitioner(2),(x,y) => x + y) rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[95] at reduceByKey at :23 scala> rdd2.collect res87: Array[(String, Int)] = Array((B,3), (A,2), (C,1)) scala> rdd2.partitions.size res88: Int = 2
3 reduceByKeyLocally
def reduceByKeyLocally(func: (V, V) => V): Map[K, V]
该函数将RDD[K,V]中每个K对应的V值根据映射函数来运算,运算结果映射到一个Map[K,V]中,而不是RDD[K,V]。
例子:
scala> var rdd1 = sc.makeRDD(Array(("A",0),("A",2),("B",1),("B",2),("C",1))) rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[91] at makeRDD at :21 scala> rdd1.reduceByKeyLocally((x,y) => x + y) res90: scala.collection.Map[String,Int] = Map(B -> 3, A -> 2, C -> 1)
相关文章推荐
- RDD键值转换操作(3)–groupByKey、reduceByKey、reduceByKeyLocally
- Spark算子:RDD键值转换操作(3)–groupByKey、reduceByKey、reduceByKeyLocally
- Spark算子:RDD键值转换操作(3)–groupByKey、reduceByKey、reduceByKeyLocally
- Spark算子:RDD键值转换操作(3)–groupBy、keyBy、groupByKey、reduceByKey、reduceByKeyLocally
- Spark算子:RDD键值转换操作(3)–groupByKey、reduceByKey、reduceByKeyLocally
- 3.3 Spark RDD 键值转换操作2-combineByKey、foldByKey
- 3.3 Spark RDD 键值转换操作4-cogroup、join
- spark RDD算子(七)之键值对分组操作 groupByKey,cogroup
- 3.3 Spark RDD键值转换操作5-leftOuterJoin、rightOuterJoin、subtractByKey
- Spark算子:RDD键值转换操作(2)–combineByKey、foldByKey
- spark RDD算子(六)之键值对聚合操作reduceByKey,foldByKey,排序操作sortByKey
- Spark算子:RDD键值转换操作(5)–leftOuterJoin、rightOuterJoin、subtractByKey
- spark RDD算子(五)之键值对聚合操作 combineByKey
- RDD键值转换操作(2)–combineByKey、foldByKey
- Spark算子:RDD键值转换操作(5)–leftOuterJoin、rightOuterJoin、subtractByKey
- 深入理解groupByKey、reduceByKey区别——本质就是一个local machine的reduce操作
- RDD键值转换操作(5)–leftOuterJoin、rightOuterJoin、subtractByKey
- Spark算子:RDD键值转换操作(2)–combineByKey、foldByKey
- spark RDD算子(八)之键值对关联操作 subtractByKey, join, rightOuterJoin, leftOuterJoin
- Spark API编程动手实战-04-以在Spark 1.2版本实现对union、groupByKey、join、reduce、lookup等操作实践