您的位置：首页 > 其它

Spark RDD 窄依赖研究

2017-12-15 15:23 357 查看

1..简介

　　spark从RDD依赖上来说分为窄依赖和宽依赖。

其中可以这样区分是哪种依赖：当父RDD的一个partition被子RDD的多个partitions引用到的时候则说明是宽依赖，否则为窄依赖。

宽依赖会触发shuffe，宽依赖也是一个job钟不同stage的分界线。

本篇文章主要讨论一下窄依赖的场景。

2.依赖关系的建立

　　字RDD内部维护着父RDD的依赖关系，下列是依赖的抽象类，其中属性rdd就是父RDD

/** *::DeveloperApi:: *Baseclassfordependencies. */ @DeveloperApi abstractclassDependency[T]extendsSerializable{ defrdd:RDD[T] }

　　

3.窄依赖的三种形式：

　　窄依赖的抽象类如下：

/**
*::DeveloperApi::
*BaseclassfordependencieswhereeachpartitionofthechildRDDdependsonasmallnumber
*ofpartitionsoftheparentRDD.Narrowdependenciesallowforpipelinedexecution.
*/
@DeveloperApi
abstractclassNarrowDependency[T](_rdd:RDD[T])extendsDependency[T]{
/**
*Gettheparentpartitionsforachildpartition.
*@parampartitionIdapartitionofthechildRDD
*@returnthepartitionsoftheparentRDDthatthechildpartitiondependsupon
*/
defgetParents(partitionId:Int):Seq[Int]

overridedefrdd:RDD[T]=_rdd
}

　　窄依赖形式一：MAP，Filter....

如上两个RDD的转换时通过MAP或者Filter等转换的，RDD的各个partition都是一一对应的，从执行时可以并行化的。

子RDD的分区依赖的父RDD的分区ID是一样不会有变化，这样的窄依赖实现类如下：

/**
*::DeveloperApi::
*Representsaone-to-onedependencybetweenpartitionsoftheparentandchildRDDs.
*/
@DeveloperApi
classOneToOneDependency[T](rdd:RDD[T])extendsNarrowDependency[T](rdd){
overridedefgetParents(partitionId:Int):List[Int]=List(partitionId)//子RDD的某个分区ID是和父RDD的分区ID是一致的
}

　　

　　窄依赖方式二：UNION

先来看看其实现类：

/**
*::DeveloperApi::
*Representsaone-to-onedependencybetweenrangesofpartitionsintheparentandchildRDDs.
*@paramrddtheparentRDD
*@paraminStartthestartoftherangeintheparentRDD
*@paramoutStartthestartoftherangeinthechildRDD
*@paramlengththelengthoftherange
*/
@DeveloperApi
classRangeDependency[T](rdd:RDD[T],inStart:Int,outStart:Int,length:Int)
extendsNarrowDependency[T](rdd){

overridedefgetParents(partitionId:Int):List[Int]={
if(partitionId>=outStart&&partitionId<outStart+length){
List(partitionId-outStart+inStart)
}else{
Nil
}
}

一开始并不好理解上述代码，可参考下图，下图中将各个参数的意义图形化展示：

所以上述中子RDD分区中的位号（partitionid）和父RDD的位置号（partitionid）相对的差值（outStart-inStart）

if(partitionId>=outStart&&partitionId<outStart+length)这段代码的意义：检查当前子RDD分区ID是否在当前父RDD下的范围内

partitionId-outStart+inStart的意思是：当前子RDD分区id（位置号）与差值相减得出其在父RDD上的分区位置号（id）其实就是：partitionId-（outStart-inStart）

窄依赖方式三：joinwithinputsco-partitioned
此场景适用于窄依赖方式一。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航