您的位置：首页 > 编程语言 > Java开发

【Spark Java API】Action(5)—treeAggregate、treeReduce

2016-08-20 12:09 288 查看

treeAggregate

官方文档描述：

Aggregates the elements of this RDD in a multi-level tree pattern.

函数原型：

def treeAggregate[U](
zeroValue: U,
seqOp: JFunction2[U, T, U],
combOp: JFunction2[U, U, U],
depth: Int): U
def treeAggregate[U](
zeroValue: U,
seqOp: JFunction2[U, T, U],
combOp: JFunction2[U, U, U]): U

可理解为更复杂的多阶aggregate。

源码分析：

def treeAggregate[U: ClassTag](zeroValue: U)(
seqOp: (U, T) => U,
combOp: (U, U) => U,
depth: Int = 2): U = withScope {
require(depth >= 1, s"Depth must be greater than or equal to 1 but got $depth.")
if (partitions.length == 0) {
Utils.clone(zeroValue, context.env.closureSerializer.newInstance())
} else {
val cleanSeqOp = context.clean(seqOp)
val cleanCombOp = context.clean(combOp)
val aggregatePartition =
(it: Iterator[T]) => it.aggregate(zeroValue)(cleanSeqOp, cleanCombOp)
var partiallyAggregated = mapPartitions(it => Iterator(aggregatePartition(it)))
var numPartitions = partiallyAggregated.partitions.length
val scale = math.max(math.ceil(math.pow(numPartitions, 1.0 / depth)).toInt, 2)
// If creating an extra level doesn't help reduce
// the wall-clock time, we stop tree aggregation.
// Don't trigger TreeAggregation when it doesn't save wall-clock time
while (numPartitions > scale + math.ceil(numPartitions.toDouble / scale)) {
numPartitions /= scale
val curNumPartitions = numPartitions
partiallyAggregated = partiallyAggregated.mapPartitionsWithIndex {
(i, iter) => iter.map((i % curNumPartitions, _))
}.reduceByKey(new HashPartitioner(curNumPartitions), cleanCombOp).values
}
partiallyAggregated.reduce(cleanCombOp)
}
}

从源码中可以看出，treeAggregate函数先是对每个分区利用scala的aggregate函数进行局部聚合的操作；同时，依据depth参数计算scale，如果当分区数量过多时，则按
i%curNumPartitions
进行key值计算，再按key进行重新分区合并计算；最后，在进行reduce聚合操作。这样可以通过调解深度来减少reduce的开销。

实例：

List<Integer> data = Arrays.asList(5, 1, 1, 4, 4, 2, 2);
JavaRDD<Integer> javaRDD = javaSparkContext.parallelize(data,3);
//转化操作
JavaRDD<String> javaRDD1 = javaRDD.map(new Function<Integer, String>() {
@Override
public String call(Integer v1) throws Exception {
return Integer.toString(v1);
}
});

String result1 = javaRDD1.treeAggregate("0", new Function2<String, String, String>() {
@Override
public String call(String v1, String v2) throws Exception {
System.out.println(v1 + "=seq=" + v2);
return v1 + "=seq=" + v2;
}
}, new Function2<String, String, String>() {
@Override
public String call(String v1, String v2) throws Exception {
System.out.println(v1 + "<=comb=>" + v2);
return v1 + "<=comb=>" + v2;
}
});
System.out.println(result1);

treeReduce

官方文档描述：

Reduces the elements of this RDD in a multi-level tree pattern.

函数原型：

def treeReduce(f: JFunction2[T, T, T], depth: Int): T
def treeReduce(f: JFunction2[T, T, T]): T

与treeAggregate类似，只不过是seqOp和combOp相同的treeAggregate。

源码分析：

def treeReduce(f: (T, T) => T, depth: Int = 2): T = withScope {
require(depth >= 1, s"Depth must be greater than or equal to 1 but got $depth.")
val cleanF = context.clean(f)
val reducePartition: Iterator[T] => Option[T] = iter => {
if (iter.hasNext) {
Some(iter.reduceLeft(cleanF))
} else {
None
}
}
val partiallyReduced = mapPartitions(it => Iterator(reducePartition(it)))
val op: (Option[T], Option[T]) => Option[T] = (c, x) => {
if (c.isDefined && x.isDefined) {
Some(cleanF(c.get, x.get))
} else if (c.isDefined) {
c
} else if (x.isDefined) {
x
} else {
None
}
}
partiallyReduced.treeAggregate(Option.empty[T])(op, op, depth)
.getOrElse(throw new UnsupportedOperationException("empty collection"))}

从源码中可以看出，treeReduce函数先是针对每个分区利用scala的reduceLeft函数进行计算；最后，在将局部合并的RDD进行treeAggregate计算，这里的seqOp和combOp一样，初值为空。在实际应用中，可以用treeReduce来代替reduce，主要是用于单个reduce操作开销比较大，而treeReduce可以通过调整深度来控制每次reduce的规模。

实例：

List<Integer> data = Arrays.asList(5, 1, 1, 4, 4, 2, 2);
JavaRDD<Integer> javaRDD = javaSparkContext.parallelize(data,5);
JavaRDD<String> javaRDD1 = javaRDD.map(new Function<Integer, String>() {
@Override
public String call(Integer v1) throws Exception {
return Integer.toString(v1);
}
});
String result = javaRDD1.treeReduce(new Function2<String, String, String>() {
@Override
public String call(String v1, String v2) throws Exception {
System.out.println(v1 + "=" + v2);
return v1 + "=" + v2;
}
});
System.out.println("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~" + treeReduceRDD);

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： spark

相关文章推荐

新的分享

章节导航