您的位置:首页 > 运维架构

spark集群搭建和简单运维操作

2019-06-02 21:44 387 查看
版权声明:@抛物线 https://blog.csdn.net/qq_28513801/article/details/90744008

spark

Apache Spark 是专为大规模数据处理而设计的快速通用的计算引擎。
Spark是UC Berkeley AMP lab (加州大学伯克利分校的AMP实验室)所开源的类Hadoop MapReduce的通用并行框架,
Spark,拥有Hadoop MapReduce所具有的优点;
但不同于MapReduce的是——Job中间输出结果可以保存在内存中,从而不再需要读写HDFS,
因此Spark能更好地适用于数据挖掘与机器学习等需要迭代的MapReduce的算法。

Spark 是一种与 Hadoop 相似的开源集群计算环境,
但是两者之间还存在一些不同之处,这些有用的不同之处使 Spark 在某些工作负载方面表现得更加优越,
换句话说,Spark 启用了内存分布数据集,除了能够提供交互式查询外,它还可以优化迭代工作负载。

Spark 是在 Scala 语言中实现的,它将 Scala 用作其应用程序框架。
与 Hadoop 不同,Spark 和 Scala 能够紧密集成,其中的 Scala 可以像操作本地集合对象一样轻松地操作分布式数据集。

尽管创建 Spark 是为了支持分布式数据集上的迭代作业,
但是实际上它是对 Hadoop 的补充,可以在 Hadoop 文件系统中并行运行。
通过名为 Mesos 的第三方集群框架可以支持此行为。

Spark 由加州大学伯克利分校 AMP 实验室 (Algorithms, Machines, and People Lab) 开发,
可用来构建大型的、低延迟的数据分析应用程序。





‘’

启动 spark-shell 后,在 scala 中加载数据“1,2,3,4,5,6,7,8,9,10”, 求这些数据的 2 倍乘积能够被 3 整除的数字,并通过 toDebugString 方法来查看 RDD 的谱系。将以上操作命令和结果信息以文本形式提交到答题框中。

scala> val num = sc.parallelize(1 to 10)
num: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:27

scala> val doublenum = num.map(_*2)
doublenum: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at map at <console>:29

scala> val threenum = doublenum.filter(_%3==0)
threenum: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[2] at filter at <console>:31
scala> threenum.collect
19/02/11 13:06:18 INFO SparkContext: Starting job: collect at <console>:34
19/02/11 13:06:18 INFO DAGScheduler: Got job 0 (collect at <console>:34) with 2 output partitions
19/02/11 13:06:18 INFO DAGScheduler: Final stage: ResultStage 0 (collect at <console>:34)
19/02/11 13:06:18 INFO DAGScheduler: Parents of final stage: List()
19/02/11 13:06:18 INFO DAGScheduler: Missing parents: List()
19/02/11 13:06:18 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[2] at filter at <console>:31), which has no missing parents
19/02/11 13:06:18 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 2.2 KB, free 511.1 MB)
19/02/11 13:06:18 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1333.0 B, free 511.1 MB)
19/02/11 13:06:18 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:58532 (size: 1333.0 B, free: 511.1 MB)
19/02/11 13:06:18 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1008
19/02/11 13:06:18 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[2] at filter at <console>:31)
19/02/11 13:06:18 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
19/02/11 13:06:18 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 2078 bytes)
19/02/11 13:06:18 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, partition 1,PROCESS_LOCAL, 2135 bytes)
19/02/11 13:06:18 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
19/02/11 13:06:18 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
19/02/11 13:06:18 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 902 bytes result sent to driver
19/02/11 13:06:18 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 906 bytes result sent to driver
19/02/11 13:06:18 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 207 ms on localhost (1/2)
19/02/11 13:06:18 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 172 ms on localhost (2/2)
19/02/11 13:06:18 INFO DAGScheduler: ResultStage 0 (collect at <console>:34) finished in 0.275 s
19/02/11 13:06:18 INFO DAGScheduler: Job 0 finished: collect at <console>:34, took 0.714041 s
19/02/11 13:06:18 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
res2: Array[Int] = Array(6, 12, 18)

scala> threenum.toDebugString
res3: String =
(2) MapPartitionsRDD[2] at filter at <console>:31 []
|  MapPartitionsRDD[1] at map at <console>:29 []
|  ParallelCollectionRDD[0] at parallelize at <console>:27 []

scala>

3.启动 spark-shell 后,在 scala 中加载 Key-Value 数据(“A”,1),(“B”,2), (“C”,3),(“A”,4), (“B”,5), (“C”,4), (“A”,3), (“A”,9), (“B”,4), (“D”,5),将这些数据以 Key 为基准进行升序排序,并以 Key 为基准进行分组。 将以上操作命令和结果信息以文本形式提交到答题框中。

scala> val kv=sc.parallelize(List(("A",1),("B",2),("C",3),("A",4),("B",5),("C",4),("A",3),("A",9),("B",4),("D",5)))
kv: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[3] at parallelize at <console>:27

scala> kv.sortByKey().collect
19/02/11 13:20:10 INFO SparkContext: Starting job: sortByKey at <console>:30
19/02/11 13:20:10 INFO DAGScheduler: Got job 1 (sortByKey at <console>:30) with 2 output partitions
19/02/11 13:20:10 INFO DAGScheduler: Final stage: ResultStage 1 (sortByKey at <console>:30)
19/02/11 13:20:10 INFO DAGScheduler: Parents of final stage: List()
19/02/11 13:20:10 INFO DAGScheduler: Missing parents: List()
19/02/11 13:20:10 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[5] at sortByKey at <console>:30), which has no missing parents
19/02/11 13:20:10 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 2.3 KB, free 511.1 MB)
19/02/11 13:20:10 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 1398.0 B, free 511.1 MB)
19/02/11 13:20:10 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:58532 (size: 1398.0 B, free: 511.1 MB)
19/02/11 13:20:10 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1008
19/02/11 13:20:10 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 1 (MapPartitionsRDD[5] at sortByKey at <console>:30)
19/02/11 13:20:10 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
19/02/11 13:20:10 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, localhost, partition 0,PROCESS_LOCAL, 2261 bytes)
19/02/11 13:20:10 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, localhost, partition 1,PROCESS_LOCAL, 2255 bytes)
19/02/11 13:20:10 INFO Executor: Running task 0.0 in stage 1.0 (TID 2)
19/02/11 13:20:10 INFO Executor: Running task 1.0 in stage 1.0 (TID 3)
19/02/11 13:20:10 INFO Executor: Finished task 0.0 in stage 1.0 (TID 2). 1179 bytes result sent to driver
19/02/11 13:20:10 INFO Executor: Finished task 1.0 in stage 1.0 (TID 3). 1178 bytes result sent to driver
19/02/11 13:20:10 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 71 ms on localhost (1/2)
19/02/11 13:20:10 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 61 ms on localhost (2/2)
19/02/11 13:20:10 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
19/02/11 13:20:10 INFO DAGScheduler: ResultStage 1 (sortByKey at <console>:30) finished in 0.075 s
19/02/11 13:20:10 INFO DAGScheduler: Job 1 finished: sortByKey at <console>:30, took 0.095223 s
19/02/11 13:20:10 INFO SparkContext: Starting job: collect at <console>:30
19/02/11 13:20:10 INFO DAGScheduler: Registering RDD 3 (parallelize at <console>:27)
19/02/11 13:20:10 INFO DAGScheduler: Got job 2 (collect at <console>:30) with 2 output partitions
19/02/11 13:20:10 INFO DAGScheduler: Final stage: ResultStage 3 (collect at <console>:30)
19/02/11 13:20:10 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 2)
19/02/11 13:20:10 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 2)
19/02/11 13:20:10 INFO DAGScheduler: Submitting ShuffleMapStage 2 (ParallelCollectionRDD[3] at parallelize at <console>:27), which has no missing parents
19/02/11 13:20:10 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 2.3 KB, free 511.1 MB)
19/02/11 13:20:10 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1427.0 B, free 511.1 MB)
19/02/11 13:20:10 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:58532 (size: 1427.0 B, free: 511.1 MB)
19/02/11 13:20:10 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1008
19/02/11 13:20:10 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 2 (ParallelCollectionRDD[3] at parallelize at <console>:27)
19/02/11 13:20:10 INFO TaskSchedulerImpl: Adding task set 2.0 with 2 tasks
19/02/11 13:20:10 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 4, localhost, partition 0,PROCESS_LOCAL, 2250 bytes)
19/02/11 13:20:10 INFO TaskSetManager: Starting task 1.0 in stage 2.0 (TID 5, localhost, partition 1,PROCESS_LOCAL, 2244 bytes)
19/02/11 13:20:10 INFO Executor: Running task 0.0 in stage 2.0 (TID 4)
19/02/11 13:20:10 INFO Executor: Running task 1.0 in stage 2.0 (TID 5)
19/02/11 13:20:10 INFO Executor: Finished task 0.0 in stage 2.0 (TID 4). 1159 bytes result sent to driver
19/02/11 13:20:10 INFO Executor: Finished task 1.0 in stage 2.0 (TID 5). 1159 bytes result sent to driver
19/02/11 13:20:10 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 4) in 70 ms on localhost (1/2)
19/02/11 13:20:10 INFO TaskSetManager: Finished task 1.0 in stage 2.0 (TID 5) in 69 ms on localhost (2/2)
19/02/11 13:20:10 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool
19/02/11 13:20:10 INFO DAGScheduler: ShuffleMapStage 2 (parallelize at <console>:27) finished in 0.074 s
19/02/11 13:20:10 INFO DAGScheduler: looking for newly runnable stages
19/02/11 13:20:10 INFO DAGScheduler: running: Set()
19/02/11 13:20:10 INFO DAGScheduler: waiting: Set(ResultStage 3)
19/02/11 13:20:10 INFO DAGScheduler: failed: Set()
19/02/11 13:20:10 INFO DAGScheduler: Submitting ResultStage 3 (ShuffledRDD[6] at sortByKey at <console>:30), which has no missing parents
19/02/11 13:20:10 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 2.9 KB, free 511.1 MB)
19/02/11 13:20:10 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 1770.0 B, free 511.1 MB)
19/02/11 13:20:10 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:58532 (size: 1770.0 B, free: 511.1 MB)
19/02/11 13:20:10 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1008
19/02/11 13:20:10 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 3 (ShuffledRDD[6] at sortByKey at <console>:30)
19/02/11 13:20:10 INFO TaskSchedulerImpl: Adding task set 3.0 with 2 tasks
19/02/11 13:20:10 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID 6, localhost, partition 0,NODE_LOCAL, 1894 bytes)
19/02/11 13:20:10 INFO TaskSetManager: Starting task 1.0 in stage 3.0 (TID 7, localhost, partition 1,NODE_LOCAL, 1894 bytes)
19/02/11 13:20:10 INFO Executor: Running task 0.0 in stage 3.0 (TID 6)
19/02/11 13:20:10 INFO Executor: Running task 1.0 in stage 3.0 (TID 7)
19/02/11 13:20:10 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
19/02/11 13:20:10 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 18 ms
19/02/11 13:20:10 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
19/02/11 13:20:10 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 28 ms
19/02/11 13:20:10 INFO Executor: Finished task 0.0 in stage 3.0 (TID 6). 1433 bytes result sent to driver
19/02/11 13:20:10 INFO Executor: Finished task 1.0 in stage 3.0 (TID 7). 1347 bytes result sent to driver
19/02/11 13:20:10 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 6) in 295 ms on localhost (1/2)
19/02/11 13:20:10 INFO TaskSetManager: Finished task 1.0 in stage 3.0 (TID 7) in 295 ms on localhost (2/2)
19/02/11 13:20:10 INFO TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool
19/02/11 13:20:10 INFO DAGScheduler: ResultStage 3 (collect at <console>:30) finished in 0.300 s
19/02/11 13:20:10 INFO DAGScheduler: Job 2 finished: collect at <console>:30, took 0.456128 s
res5: Array[(String, Int)] = Array((A,1), (A,4), (A,3), (A,9), (B,2), (B,5), (B,4), (C,3), (C,4), (D,5))

scala>
scala> kv.groupByKey().collect
19/02/11 13:21:15 INFO SparkContext: Starting job: collect at <console>:30
19/02/11 13:21:15 INFO DAGScheduler: Registering RDD 3 (parallelize at <console>:27)
19/02/11 13:21:15 INFO DAGScheduler: Got job 3 (collect at <console>:30) with 2 output partitions
19/02/11 13:21:15 INFO DAGScheduler: Final stage: ResultStage 5 (collect at <console>:30)
19/02/11 13:21:15 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 4)
19/02/11 13:21:15 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 4)
19/02/11 13:21:15 INFO DAGScheduler: Submitting ShuffleMapStage 4 (ParallelCollectionRDD[3] at parallelize at <console>:27), which has no missing parents
19/02/11 13:21:15 INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 2.9 KB, free 511.1 MB)
19/02/11 13:21:15 INFO MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 1644.0 B, free 511.1 MB)
19/02/11 13:21:15 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on localhost:58532 (size: 1644.0 B, free: 511.1 MB)
19/02/11 13:21:15 INFO SparkContext: Created broadcast 4 from broadcast at DAGScheduler.scala:1008
19/02/11 13:21:15 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 4 (ParallelCollectionRDD[3] at parallelize at <console>:27)
19/02/11 13:21:15 INFO TaskSchedulerImpl: Adding task set 4.0 with 2 tasks
19/02/11 13:21:15 INFO TaskSetManager: Starting task 0.0 in stage 4.0 (TID 8, localhost, partition 0,PROCESS_LOCAL, 2250 bytes)
19/02/11 13:21:15 INFO TaskSetManager: Starting task 1.0 in stage 4.0 (TID 9, localhost, partition 1,PROCESS_LOCAL, 2244 bytes)
19/02/11 13:21:15 INFO Executor: Running task 0.0 in stage 4.0 (TID 8)
19/02/11 13:21:15 INFO Executor: Finished task 0.0 in stage 4.0 (TID 8). 1159 bytes result sent to driver
19/02/11 13:21:15 INFO TaskSetManager: Finished task 0.0 in stage 4.0 (TID 8) in 52 ms on localhost (1/2)
19/02/11 13:21:15 INFO Executor: Running task 1.0 in stage 4.0 (TID 9)
19/02/11 13:21:15 INFO Executor: Finished task 1.0 in stage 4.0 (TID 9). 1159 bytes result sent to driver
19/02/11 13:21:15 INFO TaskSetManager: Finished task 1.0 in stage 4.0 (TID 9) in 75 ms on localhost (2/2)
19/02/11 13:21:15 INFO DAGScheduler: ShuffleMapStage 4 (parallelize at <console>:27) finished in 0.076 s
19/02/11 13:21:15 INFO TaskSchedulerImpl: Removed TaskSet 4.0, whose tasks have all completed, from pool
19/02/11 13:21:15 INFO DAGScheduler: looking for newly runnable stages
19/02/11 13:21:15 INFO DAGScheduler: running: Set()
19/02/11 13:21:15 INFO DAGScheduler: waiting: Set(ResultStage 5)
19/02/11 13:21:15 INFO DAGScheduler: failed: Set()
19/02/11 13:21:15 INFO DAGScheduler: Submitting ResultStage 5 (ShuffledRDD[7] at groupByKey at <console>:30), which has no missing parents
19/02/11 13:21:15 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 3.9 KB, free 511.1 MB)
19/02/11 13:21:15 INFO MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 2.1 KB, free 511.1 MB)
19/02/11 13:21:15 INFO BlockManagerInfo: Added broadcast_5_piece0 in memory on localhost:58532 (size: 2.1 KB, free: 511.1 MB)
19/02/11 13:21:15 INFO SparkContext: Created broadcast 5 from broadcast at DAGScheduler.scala:1008
19/02/11 13:21:15 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 5 (ShuffledRDD[7] at groupByKey at <console>:30)
19/02/11 13:21:15 INFO TaskSchedulerImpl: Adding task set 5.0 with 2 tasks
19/02/11 13:21:15 INFO TaskSetManager: Starting task 0.0 in stage 5.0 (TID 10, localhost, partition 0,NODE_LOCAL, 1894 bytes)
19/02/11 13:21:15 INFO TaskSetManager: Starting task 1.0 in stage 5.0 (TID 11, localhost, partition 1,NODE_LOCAL, 1894 bytes)
19/02/11 13:21:15 INFO Executor: Running task 0.0 in stage 5.0 (TID 10)
19/02/11 13:21:15 INFO Executor: Running task 1.0 in stage 5.0 (TID 11)
19/02/11 13:21:15 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
19/02/11 13:21:15 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
19/02/11 13:21:15 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
19/02/11 13:21:15 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
19/02/11 13:21:15 INFO Executor: Finished task 0.0 in stage 5.0 (TID 10). 1780 bytes result sent to driver
19/02/11 13:21:15 INFO TaskSetManager: Finished task 0.0 in stage 5.0 (TID 10) in 84 ms on localhost (1/2)
19/02/11 13:21:15 INFO Executor: Finished task 1.0 in stage 5.0 (TID 11). 1789 bytes result sent to driver
19/02/11 13:21:15 INFO TaskSetManager: Finished task 1.0 in stage 5.0 (TID 11) in 103 ms on localhost (2/2)
19/02/11 13:21:15 INFO TaskSchedulerImpl: Removed TaskSet 5.0, whose tasks have all completed, from pool
19/02/11 13:21:15 INFO DAGScheduler: ResultStage 5 (collect at <console>:30) finished in 0.113 s
19/02/11 13:21:15 INFO DAGScheduler: Job 3 finished: collect at <console>:30, took 0.252907 s
res6: Array[(String, Iterable[Int])] = Array((B,CompactBuffer(2, 5, 4)), (D,CompactBuffer(5)), (A,CompactBuffer(1, 4, 3, 9)), (C,CompactBuffer(3, 4)))

scala>

4.启动 spark-shell 后,在 scala 中加载 Key-Value 数据(“A”,1),(“B”,3), (“C”,5),(“D”,4), (“B”,7), (“C”,4), (“E”,5), (“A”,8), (“B”,4), (“D”,5),将这些数据以 Key 为基准进行升序排序,并对相同的 Key 进行 Value 求和计算。将以上操作命令和结果信息以文本形式提交到答题框中。

scala> val kv1=sc.parallelize(List(("A",1),("B",3),("C",5),("D",4),("B",7),("C",4),("E",5),("A",8),("B",4),("D",5)))
kv1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0] at parallelize at <console>:27

scala> kv1.sortByKey().collect

19/02/11 13:31:01 INFO SparkContext: Starting job: sortByKey at <console>:30
19/02/11 13:31:01 INFO DAGScheduler: Got job 0 (sortByKey at <console>:30) with 2 output partitions
19/02/11 13:31:01 INFO DAGScheduler: Final stage: ResultStage 0 (sortByKey at <console>:30)
19/02/11 13:31:01 INFO DAGScheduler: Parents of final stage: List()
19/02/11 13:31:01 INFO DAGScheduler: Missing parents: List()
19/02/11 13:31:01 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[2] at sortByKey at <console>:30), which has no missing parents
19/02/11 13:31:03 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 2.3 KB, free 511.1 MB)
19/02/11 13:31:03 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1408.0 B, free 511.1 MB)
19/02/11 13:31:03 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:49365 (size: 1408.0 B, free: 511.1 MB)
19/02/11 13:31:03 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1008
19/02/11 13:31:03 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[2] at sortByKey at <console>:30)
19/02/11 13:31:03 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
19/02/11 13:31:03 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 2260 bytes)
19/02/11 13:31:03 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, partition 1,PROCESS_LOCAL, 2249 bytes)
19/02/11 13:31:03 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
19/02/11 13:31:03 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
19/02/11 13:31:04 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1178 bytes result sent to driver
19/02/11 13:31:04 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 1177 bytes result sent to driver
19/02/11 13:31:04 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 657 ms on localhost (1/2)
19/02/11 13:31:04 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 582 ms on localhost (2/2)
19/02/11 13:31:04 INFO DAGScheduler: ResultStage 0 (sortByKey at <console>:30) finished in 0.814 s
19/02/11 13:31:04 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
19/02/11 13:31:04 INFO DAGScheduler: Job 0 finished: sortByKey at <console>:30, took 2.947240 s
19/02/11 13:31:04 INFO SparkContext: Starting job: collect at <console>:30
19/02/11 13:31:04 INFO DAGScheduler: Registering RDD 0 (parallelize at <console>:27)
19/02/11 13:31:04 INFO DAGScheduler: Got job 1 (collect at <console>:30) with 2 output partitions
19/02/11 13:31:04 INFO DAGScheduler: Final stage: ResultStage 2 (collect at <console>:30)
19/02/11 13:31:04 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 1)
19/02/11 13:31:04 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 1)
19/02/11 13:31:04 INFO DAGScheduler: Submitting ShuffleMapStage 1 (ParallelCollectionRDD[0] at parallelize at <console>:27), which has no missing parents
19/02/11 13:31:04 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 2.3 KB, free 511.1 MB)
19/02/11 13:31:04 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 1424.0 B, free 511.1 MB)
19/02/11 13:31:04 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:49365 (size: 1424.0 B, free: 511.1 MB)
19/02/11 13:31:04 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1008
19/02/11 13:31:04 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 1 (ParallelCollectionRDD[0] at parallelize at <console>:27)
19/02/11 13:31:04 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
19/02/11 13:31:04 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, localhost, partition 0,PROCESS_LOCAL, 2249 bytes)
19/02/11 13:31:04 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, localhost, partition 1,PROCESS_LOCAL, 2238 bytes)
19/02/11 13:31:04 INFO Executor: Running task 0.0 in stage 1.0 (TID 2)
19/02/11 13:31:04 INFO Executor: Running task 1.0 in stage 1.0 (TID 3)
19/02/11 13:31:04 INFO Executor: Finished task 1.0 in stage 1.0 (TID 3). 1159 bytes result sent to driver
19/02/11 13:31:04 INFO Executor: Finished task 0.0 in stage 1.0 (TID 2). 1159 bytes result sent to driver
19/02/11 13:31:04 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 176 ms on localhost (1/2)
19/02/11 13:31:04 INFO DAGScheduler: ShuffleMapStage 1 (parallelize at <console>:27) finished in 0.225 s
19/02/11 13:31:04 INFO DAGScheduler: looking for newly runnable stages
19/02/11 13:31:04 INFO DAGScheduler: running: Set()
19/02/11 13:31:04 INFO DAGScheduler: waiting: Set(ResultStage 2)
19/02/11 13:31:04 INFO DAGScheduler: failed: Set()
19/02/11 13:31:04 INFO DAGScheduler: Submitting ResultStage 2 (ShuffledRDD[3] at sortByKey at <console>:30), which has no missing parents
19/02/11 13:31:04 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 209 ms on localhost (2/2)
19/02/11 13:31:04 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
19/02/11 13:31:04 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 2.9 KB, free 511.1 MB)
19/02/11 13:31:04 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1770.0 B, free 511.1 MB)
19/02/11 13:31:04 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:49365 (size: 1770.0 B, free: 511.1 MB)
19/02/11 13:31:04 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1008
19/02/11 13:31:04 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 2 (ShuffledRDD[3] at sortByKey at <console>:30)
19/02/11 13:31:04 INFO TaskSchedulerImpl: Adding task set 2.0 with 2 tasks
19/02/11 13:31:04 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 4, localhost, partition 0,NODE_LOCAL, 1894 bytes)
19/02/11 13:31:04 INFO TaskSetManager: Starting task 1.0 in stage 2.0 (TID 5, localhost, partition 1,NODE_LOCAL, 1894 bytes)
19/02/11 13:31:04 INFO Executor: Running task 1.0 in stage 2.0 (TID 5)
19/02/11 13:31:04 INFO Executor: Running task 0.0 in stage 2.0 (TID 4)
19/02/11 13:31:04 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
19/02/11 13:31:04 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 9 ms
19/02/11 13:31:05 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
19/02/11 13:31:05 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 17 ms
19/02/11 13:31:05 INFO Executor: Finished task 0.0 in stage 2.0 (TID 4). 1391 bytes result sent to driver
19/02/11 13:31:05 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 4) in 512 ms on localhost (1/2)
19/02/11 13:31:05 INFO Executor: Finished task 1.0 in stage 2.0 (TID 5). 1385 bytes result sent to driver
19/02/11 13:31:05 INFO TaskSetManager: Finished task 1.0 in stage 2.0 (TID 5) in 530 ms on localhost (2/2)
19/02/11 13:31:05 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool
19/02/11 13:31:05 INFO DAGScheduler: ResultStage 2 (collect at <console>:30) finished in 0.541 s
19/02/11 13:31:05 INFO DAGScheduler: Job 1 finished: collect at <console>:30, took 0.940656 s
res0: Array[(String, Int)] = Array((A,1), (A,8), (B,3), (B,7), (B,4), (C,5), (C,4), (D,4), (D,5), (E,5))
scala> kv1.reduceByKey(_+_).collect
19/02/11 13:33:00 INFO SparkContext: Starting job: collect at <console>:30
19/02/11 13:33:00 INFO DAGScheduler: Registering RDD 0 (parallelize at <console>:27)
19/02/11 13:33:00 INFO DAGScheduler: Got job 2 (collect at <console>:30) with 2 output partitions
19/02/11 13:33:00 INFO DAGScheduler: Final stage: ResultStage 4 (collect at <console>:30)
19/02/11 13:33:00 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 3)
19/02/11 13:33:00 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 3)
19/02/11 13:33:00 INFO DAGScheduler: Submitting ShuffleMapStage 3 (ParallelCollectionRDD[0] at parallelize at <console>:27), which has no missing parents
19/02/11 13:33:00 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 2.0 KB, free 511.1 MB)
19/02/11 13:33:00 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 1291.0 B, free 511.1 MB)
19/02/11 13:33:00 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:49365 (size: 1291.0 B, free: 511.1 MB)
19/02/11 13:33:00 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1008
19/02/11 13:33:00 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 3 (ParallelCollectionRDD[0] at parallelize at <console>:27)
19/02/11 13:33:00 INFO TaskSchedulerImpl: Adding task set 3.0 with 2 tasks
19/02/11 13:33:00 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID 6, localhost, partition 0,PROCESS_LOCAL, 2249 bytes)
19/02/11 13:33:00 INFO TaskSetManager: Starting task 1.0 in stage 3.0 (TID 7, localhost, partition 1,PROCESS_LOCAL, 2238 bytes)
19/02/11 13:33:00 INFO Executor: Running task 0.0 in stage 3.0 (TID 6)
19/02/11 13:33:00 INFO Executor: Running task 1.0 in stage 3.0 (TID 7)
19/02/11 13:33:00 INFO Executor: Finished task 1.0 in stage 3.0 (TID 7). 1159 bytes result sent to driver
19/02/11 13:33:00 INFO Executor: Finished task 0.0 in stage 3.0 (TID 6). 1159 bytes result sent to driver
19/02/11 13:33:00 INFO TaskSetManager: Finished task 1.0 in stage 3.0 (TID 7) in 52 ms on localhost (1/2)
19/02/11 13:33:00 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 6) in 52 ms on localhost (2/2)
19/02/11 13:33:00 INFO TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool
19/02/11 13:33:00 INFO DAGScheduler: ShuffleMapStage 3 (parallelize at <console>:27) finished in 0.054 s
19/02/11 13:33:00 INFO DAGScheduler: looking for newly runnable stages
19/02/11 13:33:00 INFO DAGScheduler: running: Set()
19/02/11 13:33:00 INFO DAGScheduler: waiting: Set(ResultStage 4)
19/02/11 13:33:00 INFO DAGScheduler: failed: Set()
19/02/11 13:33:00 INFO DAGScheduler: Submitting ResultStage 4 (ShuffledRDD[4] at reduceByKey at <console>:30), which has no missing parents
19/02/11 13:33:00 INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 2.7 KB, free 511.1 MB)
19/02/11 13:33:00 INFO MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 1609.0 B, free 511.1 MB)
19/02/11 13:33:00 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on localhost:49365 (size: 1609.0 B, free: 511.1 MB)
19/02/11 13:33:00 INFO SparkContext: Created broadcast 4 from broadcast at DAGScheduler.scala:1008
19/02/11 13:33:00 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 4 (ShuffledRDD[4] at reduceByKey at <console>:30)
19/02/11 13:33:00 INFO TaskSchedulerImpl: Adding task set 4.0 with 2 tasks
19/02/11 13:33:00 INFO TaskSetManager: Starting task 0.0 in stage 4.0 (TID 8, localhost, partition 0,NODE_LOCAL, 1894 bytes)
19/02/11 13:33:00 INFO TaskSetManager: Starting task 1.0 in stage 4.0 (TID 9, localhost, partition 1,NODE_LOCAL, 1894 bytes)
19/02/11 13:33:00 INFO Executor: Running task 0.0 in stage 4.0 (TID 8)
19/02/11 13:33:00 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
19/02/11 13:33:00 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
19/02/11 13:33:00 INFO Executor: Finished task 0.0 in stage 4.0 (TID 8). 1327 bytes result sent to driver
19/02/11 13:33:00 INFO TaskSetManager: Finished task 0.0 in stage 4.0 (TID 8) in 29 ms on localhost (1/2)
19/02/11 13:33:00 INFO Executor: Running task 1.0 in stage 4.0 (TID 9)
19/02/11 13:33:00 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
19/02/11 13:33:00 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
19/02/11 13:33:00 INFO Executor: Finished task 1.0 in stage 4.0 (TID 9). 1342 bytes result sent to driver
19/02/11 13:33:00 INFO TaskSetManager: Finished task 1.0 in stage 4.0 (TID 9) in 43 ms on localhost (2/2)
19/02/11 13:33:00 INFO TaskSchedulerImpl: Removed TaskSet 4.0, whose tasks have all completed, from pool
19/02/11 13:33:00 INFO DAGScheduler: ResultStage 4 (collect at <console>:30) finished in 0.051 s
19/02/11 13:33:00 INFO DAGScheduler: Job 2 finished: collect at <console>:30, took 0.139948 s
res1: Array[(String, Int)] = Array((B,14), (D,9), (A,9), (C,9), (E,5))

scala>

5.启动 spark-shell 后,在 scala 中加载 Key-Value 数据(“A”,4),(“A”,2), (“C”,3),(“A”,4),(“B”,5),(“C”,3),(“A”,4),以 Key 为基准进行去重操 作,并通过 toDebugString 方法来查看 RDD 的谱系。将以上操作命令和结果信 息以文本形式提交到答题框中。
`

scala> val kv2=sc.parallelize(List(("A",4),("A",2),("C",3),("A",4),("B",5),("C",3),("A",4)))
kv2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[5] at parallelize at <console>:27

scala> kv2.distinct.collect
19/02/11 13:42:15 INFO SparkContext: Starting job: collect at <console>:30
19/02/11 13:42:15 INFO DAGScheduler: Registering RDD 6 (distinct at <console>:30)
19/02/11 13:42:15 INFO DAGScheduler: Got job 3 (collect at <console>:30) with 2 output partitions
19/02/11 13:42:15 INFO DAGScheduler: Final stage: ResultStage 6 (collect at <console>:30)
19/02/11 13:42:15 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 5)
19/02/11 13:42:15 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 5)
19/02/11 13:42:15 INFO DAGScheduler: Submitting ShuffleMapStage 5 (MapPartitionsRDD[6] at distinct at <console>:30), which has no missing parents
19/02/11 13:42:15 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 2.7 KB, free 511.1 MB)
19/02/11 13:42:15 INFO MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 1618.0 B, free 511.1 MB)
19/02/11 13:42:15 INFO BlockManagerInfo: Added broadcast_5_piece0 in memory on localhost:49365 (size: 1618.0 B, free: 511.1 MB)
19/02/11 13:42:15 INFO SparkContext: Created broadcast 5 from broadcast at DAGScheduler.scala:1008
19/02/11 13:42:15 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 5 (MapPartitionsRDD[6] at distinct at <console>:30)
19/02/11 13:42:15 INFO TaskSchedulerImpl: Adding task set 5.0 with 2 tasks
19/02/11 13:42:15 INFO TaskSetManager: Starting task 0.0 in stage 5.0 (TID 10, localhost, partition 0,PROCESS_LOCAL, 2209 bytes)
19/02/11 13:42:15 INFO TaskSetManager: Starting task 1.0 in stage 5.0 (TID 11, localhost, partition 1,PROCESS_LOCAL, 2224 bytes)
19/02/11 13:42:15 INFO Executor: Running task 1.0 in stage 5.0 (TID 11)
19/02/11 13:42:15 INFO Executor: Running task 0.0 in stage 5.0 (TID 10)
19/02/11 13:42:15 INFO Executor: Finished task 1.0 in stage 5.0 (TID 11). 1159 bytes result sent to driver
19/02/11 13:42:15 INFO Executor: Finished task 0.0 in stage 5.0 (TID 10). 1159 bytes result sent to driver
19/02/11 13:42:15 INFO TaskSetManager: Finished task 1.0 in stage 5.0 (TID 11) in 93 ms on localhost (1/2)
19/02/11 13:42:15 INFO TaskSetManager: Finished task 0.0 in stage 5.0 (TID 10) in 97 ms on localhost (2/2)
19/02/11 13:42:15 INFO TaskSchedulerImpl: Removed TaskSet 5.0, whose tasks have all completed, from pool
19/02/11 13:42:15 INFO DAGScheduler: ShuffleMapStage 5 (distinct at <console>:30) finished in 0.100 s
19/02/11 13:42:15 INFO DAGScheduler: looking for newly runnable stages
19/02/11 13:42:15 INFO DAGScheduler: running: Set()
19/02/11 13:42:15 INFO DAGScheduler: waiting: Set(ResultStage 6)
19/02/11 13:42:15 INFO DAGScheduler: failed: Set()
19/02/11 13:42:15 INFO DAGScheduler: Submitting ResultStage 6 (MapPartitionsRDD[8] at distinct at <console>:30), which has no missing parents
19/02/11 13:42:15 INFO MemoryStore: Block broadcast_6 stored as values in memory (estimated size 3.3 KB, free 511.1 MB)
19/02/11 13:42:15 INFO MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 1930.0 B, free 511.1 MB)
19/02/11 13:42:15 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on localhost:49365 (size: 1930.0 B, free: 511.1 MB)
19/02/11 13:42:15 INFO SparkContext: Created broadcast 6 from broadcast at DAGScheduler.scala:1008
19/02/11 13:42:15 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 6 (MapPartitionsRDD[8] at distinct at <console>:30)
19/02/11 13:42:15 INFO TaskSchedulerImpl: Adding task set 6.0 with 2 tasks
19/02/11 13:42:15 INFO TaskSetManager: Starting task 0.0 in stage 6.0 (TID 12, localhost, partition 0,NODE_LOCAL, 1894 bytes)
19/02/11 13:42:15 INFO TaskSetManager: Starting task 1.0 in stage 6.0 (TID 13, localhost, partition 1,NODE_LOCAL, 1894 bytes)
19/02/11 13:42:15 INFO Executor: Running task 0.0 in stage 6.0 (TID 12)
19/02/11 13:42:15 INFO Executor: Running task 1.0 in stage 6.0 (TID 13)
19/02/11 13:42:15 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
19/02/11 13:42:15 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
19/02/11 13:42:15 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
19/02/11 13:42:15 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
19/02/11 13:42:15 INFO Executor: Finished task 1.0 in stage 6.0 (TID 13). 1347 bytes result sent to driver
19/02/11 13:42:15 INFO Executor: Finished task 0.0 in stage 6.0 (TID 12). 1307 bytes result sent to driver
19/02/11 13:42:15 INFO TaskSetManager: Finished task 1.0 in stage 6.0 (TID 13) in 29 ms on localhost (1/2)
19/02/11 13:42:15 INFO TaskSetManager: Finished task 0.0 in stage 6.0 (TID 12) in 34 ms on localhost (2/2)
19/02/11 13:42:15 INFO TaskSchedulerImpl: Removed TaskSet 6.0, whose tasks have all completed, from pool
19/02/11 13:42:15 INFO DAGScheduler: ResultStage 6 (collect at <console>:30) finished in 0.037 s
19/02/11 13:42:15 INFO DAGScheduler: Job 3 finished: collect at <console>:30, took 0.220315 s
res2: Array[(String, Int)] = Array((A,4), (B,5), (C,3), (A,2))

scala>
scala> kv2.toDebugString
res3: String = (2) ParallelCollectionRDD[5] at parallelize at <console>:27 []

scala>

6.启动 spark-shell 后,在 scala 中加载两组 Key-Value 数据(“A”,1),(“B”, 2),(“C”,3),(“A”,4),(“B”,5)、(“A”,1),(“B”,2),(“C”,3),(“A”,4), (“B”,5),将两组数据以 Key 为基准进行 JOIN 操作,将以上操作命令和结果信 息以文本形式提交到答题框中。

scala> val kv4=sc.parallelize(List(("A",1),("B",2),("C",3),("A",4),("B",5)))
kv4: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[9] at parallelize at <console>:27

scala> val kv5=sc.parallelize(List(("A",1),("B",2),("C",3),("A",4),("B",5)))
kv5: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[10] at parallelize at <console>:27

scala> kv4.join(kv5).collect
19/02/11 13:47:38 INFO SparkContext: Starting job: collect at <console>:32
19/02/11 13:47:38 INFO DAGScheduler: Registering RDD 9 (parallelize at <console>:27)
19/02/11 13:47:38 INFO DAGScheduler: Registering RDD 10 (parallelize at <console>:27)
19/02/11 13:47:38 INFO DAGScheduler: Got job 4 (collect at <console>:32) with 2 output partitions
19/02/11 13:47:38 INFO DAGScheduler: Final stage: ResultStage 9 (collect at <console>:32)
19/02/11 13:47:38 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 7, ShuffleMapStage 8)
19/02/11 13:47:38 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 7, ShuffleMapStage 8)
19/02/11 13:47:38 INFO DAGScheduler: Submitting ShuffleMapStage 7 (ParallelCollectionRDD[9] at parallelize at <console>:27), which has no missing parents
19/02/11 13:47:38 INFO MemoryStore: Block broadcast_7 stored as values in memory (estimated size 1864.0 B, free 511.1 MB)
19/02/11 13:47:38 INFO MemoryStore: Block broadcast_7_piece0 stored as bytes in memory (estimated size 1188.0 B, free 511.1 MB)
19/02/11 13:47:38 INFO BlockManagerInfo: Added broadcast_7_piece0 in memory on localhost:49365 (size: 1188.0 B, free: 511.1 MB)
19/02/11 13:47:38 INFO SparkContext: Created broadcast 7 from broadcast at DAGScheduler.scala:1008
19/02/11 13:47:38 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 7 (ParallelCollectionRDD[9] at parallelize at <console>:27)
19/02/11 13:47:38 INFO TaskSchedulerImpl: Adding task set 7.0 with 2 tasks
19/02/11 13:47:38 INFO DAGScheduler: Submitting ShuffleMapStage 8 (ParallelCollectionRDD[10] at parallelize at <console>:27), which has no missing parents
19/02/11 13:47:38 INFO MemoryStore: Block broadcast_8 stored as values in memory (estimated size 1864.0 B, free 511.1 MB)
19/02/11 13:47:38 INFO MemoryStore: Block broadcast_8_piece0 stored as bytes in memory (estimated size 1188.0 B, free 511.1 MB)
19/02/11 13:47:38 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on localhost:49365 (size: 1188.0 B, free: 511.1 MB)
19/02/11 13:47:38 INFO SparkContext: Created broadcast 8 from broadcast at DAGScheduler.scala:1008
19/02/11 13:47:38 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 8 (ParallelCollectionRDD[10] at parallelize at <console>:27)
19/02/11 13:47:38 INFO TaskSchedulerImpl: Adding task set 8.0 with 2 tasks
19/02/11 13:47:38 INFO TaskSetManager: Starting task 0.0 in stage 7.0 (TID 14, localhost, partition 0,PROCESS_LOCAL, 2188 bytes)
19/02/11 13:47:38 INFO TaskSetManager: Starting task 1.0 in stage 7.0 (TID 15, localhost, partition 1,PROCESS_LOCAL, 2208 bytes)
19/02/11 13:47:38 INFO Executor: Running task 0.0 in stage 7.0 (TID 14)
19/02/11 13:47:38 INFO Executor: Running task 1.0 in stage 7.0 (TID 15)
19/02/11 13:47:38 INFO Executor: Finished task 1.0 in stage 7.0 (TID 15). 1159 bytes result sent to driver
19/02/11 13:47:38 INFO Executor: Finished task 0.0 in stage 7.0 (TID 14). 1159 bytes result sent to driver
19/02/11 13:47:38 INFO TaskSetManager: Starting task 0.0 in stage 8.0 (TID 16, localhost, partition 0,PROCESS_LOCAL, 2188 bytes)
19/02/11 13:47:38 INFO TaskSetManager: Starting task 1.0 in stage 8.0 (TID 17, localhost, partition 1,PROCESS_LOCAL, 2208 bytes)
19/02/11 13:47:38 INFO Executor: Running task 0.0 in stage 8.0 (TID 16)
19/02/11 13:47:38 INFO Executor: Running task 1.0 in stage 8.0 (TID 17)
19/02/11 13:47:38 INFO TaskSetManager: Finished task 0.0 in stage 7.0 (TID 14) in 133 ms on localhost (1/2)
19/02/11 13:47:38 INFO TaskSetManager: Finished task 1.0 in stage 7.0 (TID 15) in 134 ms on localhost (2/2)
19/02/11 13:47:38 INFO TaskSchedulerImpl: Removed TaskSet 7.0, whose tasks have all completed, from pool
19/02/11 13:47:38 INFO DAGScheduler: ShuffleMapStage 7 (parallelize at <console>:27) finished in 0.177 s
19/02/11 13:47:38 INFO DAGScheduler: looking for newly runnable stages
19/02/11 13:47:38 INFO DAGScheduler: running: Set(ShuffleMapStage 8)
19/02/11 13:47:38 INFO DAGScheduler: waiting: Set(ResultStage 9)
19/02/11 13:47:38 INFO DAGScheduler: failed: Set()
19/02/11 13:47:38 INFO Executor: Finished task 0.0 in stage 8.0 (TID 16). 1159 bytes result sent to driver
19/02/11 13:47:38 INFO Executor: Finished task 1.0 in stage 8.0 (TID 17). 1159 bytes result sent to driver
19/02/11 13:47:38 INFO TaskSetManager: Finished task 0.0 in stage 8.0 (TID 16) in 68 ms on localhost (1/2)
19/02/11 13:47:38 INFO DAGScheduler: ShuffleMapStage 8 (parallelize at <console>:27) finished in 0.166 s
19/02/11 13:47:38 INFO DAGScheduler: looking for newly runnable stages
19/02/11 13:47:38 INFO DAGScheduler: running: Set()
19/02/11 13:47:38 INFO DAGScheduler: waiting: Set(ResultStage 9)
19/02/11 13:47:38 INFO DAGScheduler: failed: Set()
19/02/11 13:47:38 INFO DAGScheduler: Submitting ResultStage 9 (MapPartitionsRDD[13] at join at <console>:32), which has no missing parents
19/02/11 13:47:38 INFO TaskSetManager: Finished task 1.0 in stage 8.0 (TID 17) in 65 ms on localhost (2/2)
19/02/11 13:47:38 INFO TaskSchedulerImpl: Removed TaskSet 8.0, whose tasks have all completed, from pool
19/02/11 13:47:38 INFO MemoryStore: Block broadcast_9 stored as values in memory (estimated size 3.2 KB, free 511.1 MB)
19/02/11 13:47:38 INFO MemoryStore: Block broadcast_9_piece0 stored as bytes in memory (estimated size 1814.0 B, free 511.1 MB)
19/02/11 13:47:38 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on localhost:49365 (size: 1814.0 B, free: 511.1 MB)
19/02/11 13:47:38 INFO SparkContext: Created broadcast 9 from broadcast at DAGScheduler.scala:1008
19/02/11 13:47:38 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 9 (MapPartitionsRDD[13] at join at <console>:32)
19/02/11 13:47:38 INFO TaskSchedulerImpl: Adding task set 9.0 with 2 tasks
19/02/11 13:47:38 INFO TaskSetManager: Starting task 0.0 in stage 9.0 (TID 18, localhost, partition 0,PROCESS_LOCAL, 1967 bytes)
19/02/11 13:47:38 INFO TaskSetManager: Starting task 1.0 in stage 9.0 (TID 19, localhost, partition 1,PROCESS_LOCAL, 1967 bytes)
19/02/11 13:47:38 INFO Executor: Running task 1.0 in stage 9.0 (TID 19)
19/02/11 13:47:38 INFO Executor: Running task 0.0 in stage 9.0 (TID 18)
19/02/11 13:47:38 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
19/02/11 13:47:38 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
19/02/11 13:47:38 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
19/02/11 13:47:38 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
19/02/11 13:47:38 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
19/02/11 13:47:38 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
19/02/11 13:47:38 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
19/02/11 13:47:38 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
19/02/11 13:47:38 INFO Executor: Finished task 1.0 in stage 9.0 (TID 19). 1453 bytes result sent to driver
19/02/11 13:47:38 INFO Executor: Finished task 0.0 in stage 9.0 (TID 18). 1417 bytes result sent to driver
19/02/11 13:47:38 INFO TaskSetManager: Finished task 1.0 in stage 9.0 (TID 19) in 184 ms on localhost (1/2)
19/02/11 13:47:38 INFO TaskSetManager: Finished task 0.0 in stage 9.0 (TID 18) in 207 ms on localhost (2/2)
19/02/11 13:47:38 INFO TaskSchedulerImpl: Removed TaskSet 9.0, whose tasks have all completed, from pool
19/02/11 13:47:38 INFO DAGScheduler: ResultStage 9 (collect at <console>:32) finished in 0.212 s
19/02/11 13:47:38 INFO DAGScheduler: Job 4 finished: collect at <console>:32, took 0.671400 s
res4: Array[(String, (Int, Int))] = Array((B,(2,2)), (B,(2,5)), (B,(5,2)), (B,(5,5)), (A,(1,1)), (A,(1,4)), (A,(4,1)), (A,(4,4)), (C,(3,3)))

scala>

7.登录 spark-shell,定义 i 值为 1,sum 值为 0,使用 while 循环,求从 1 加 到 100 的值,最后使用 scala 的标准输出函数输出 sum 值。将上述所有操作命令 和返回结果以文本形式提交到答题框。

scala> var i=1
i: Int = 1

scala> var sum=0
sum: Int = 0

scala> while(i<100)
| {
| i+=1
| sum+=i
| }

scala> println(sum)
5049

8.登录 spark-shell,定义 i 值为 1,sum 值为 0,使用 for 循环,求从 1 加到 100 的值,最后使用 scala 的标准输出函数输出 sum 值。将上述所有操作命令和 返回结果以文本形式提交到答题框

scala> var i=1
i: Int = 1

scala> var sum=0
sum: Int = 0

scala> for(i<- 1 to 100) sum+=i

scala> println(sum)
5050

scala>
9.任何一种函数式语言中,都有 map 函数与 faltMap 这两个函数: map 函数的用法,顾名思义,将一个函数传入 map 中,
然后利用传入的这 个函数,将集合中的每个元素处理,并将处理后的结果返回。
而 flatMap 与 map 唯一不一样的地方就是传入的函数在处理完后返回值必须 是 List,
所以需要返回值是 List 才能执行 flat 这一步。
(1)登录 spark-shell,自定义一个 list,然后利用 map 函数,对这个 list 进 行元素乘 2 的操作,

将上述所有操作命令和返回结果以文本形式提交到答题框。

(2)登录 spark-shell,自定义一个 list,然后利用 flatMap 函数将 list 转换为 单个字母并转换为大写
,将上述所有命令和返回结果以文本形式提交到答题框。

10.登录大数据云主机 master 节点,在 root 目录下新建一个 abc.txt,
里面的 内容为: hadoop hive solr redis kafka hadoop storm flume sqoop docker spark spark hadoop spark elasticsearch hbase hadoop hive spark hive hadoop spark 然后登录 spark-shell,
首先使用命令统计 abc.txt 的行数,接着对 abc.txt 文档 中的单词进行计数,
并按照单词首字母的升序进行排序,最后统计结果行数,将 上述操作命令和返回结果以文本形式提交到答题框。

.登录 spark-shell,自定义一个 List,使用 spark 自带函数对这个 List 进行 去重操作,
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: