您的位置:首页 > 运维架构 > Shell

SPARK-Shell 用Scala执行WordCount

2018-01-03 10:45 387 查看
[root@master spark]#  cd /usr/spark
[root@master spark]# ./bin/spark-shell
18/01/03 11:38:39 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/01/03 11:38:40 INFO spark.SecurityManager: Changing view acls to: root
18/01/03 11:38:40 INFO spark.SecurityManager: Changing modify acls to: root
18/01/03 11:38:40 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
18/01/03 11:38:40 INFO spark.HttpServer: Starting HTTP Server
18/01/03 11:38:40 INFO server.Server: jetty-8.y.z-SNAPSHOT
18/01/03 11:38:40 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:55442
18/01/03 11:38:40 INFO util.Utils: Successfully started service 'HTTP class server' on port 55442.
Welcome to
____              __
/ __/__  ___ _____/ /__
_\ \/ _ \/ _ `/ __/  '_/
/___/ .__/\_,_/_/ /_/\_\   version 1.6.3
/_/

Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_152)
Type in expressions to have them evaluated.
Type :help for more information.
......
Spark context available as sc.
解释器已帮我们初始化了一个SparkContext对象的实体,名为sc


--从本地读取,确保每个节点都存在此文件
scala> val textFile = sc.textFile("file:///usr/spark/README.md");
--从HDFS上读取, hdfs://master:9000/user/root/README.md--上传本地文件文件到hdfs,hadoop fs -put /usr/spark/README.md /user/root;
scala> val textFile = sc.textFile("README.md");
scala> val wordCounts = textFile.flatMap(line=>line.split(" ")).map(word=>(word,1)).reduceByKey((a,b)=>a+b)
18/01/03 10:43:04 INFO mapred.FileInputFormat: Total input paths to process : 1
wordCounts: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[4] at reduceByKey at <console>:29


scala> wordCounts.collect();
18/01/03 10:43:16 INFO spark.SparkContext: Starting job: collect at <console>:32
18/01/03 10:43:16 INFO scheduler.DAGScheduler: Registering RDD 3 (map at <console>:29)
18/01/03 10:43:16 INFO scheduler.DAGScheduler: Got job 0 (collect at <console>:32) with 2 output partitions
18/01/03 10:43:16 INFO scheduler.DAGScheduler: Final stage: ResultStage 1 (collect at <console>:32)
18/01/03 10:43:16 INFO scheduler.DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
18/01/03 10:43:16 INFO scheduler.DAGScheduler: Missing parents: List(ShuffleMapStage 0)
18/01/03 10:43:16 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[3] at map at <console>:29), which has no missing parents
18/01/03 10:43:16 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.2 KB, free 517.3 MB)
18/01/03 10:43:16 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.3 KB, free 517.3 MB)
18/01/03 10:43:16 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.50.131:33833 (size: 2.3 KB, free: 517.4 MB)
18/01/03 10:43:16 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
18/01/03 10:43:16 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[3] at map at <console>:29)
18/01/03 10:43:16 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
18/01/03 10:43:16 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, slave2, partition 0,NODE_LOCAL, 2129 bytes)
18/01/03 10:43:16 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, slave3, partition 1,NODE_LOCAL, 2129 bytes)
18/01/03 10:43:17 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on slave2:52849 (size: 2.3 KB, free: 146.2 MB)
18/01/03 10:43:17 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on slave3:33238 (size: 2.3 KB, free: 146.2 MB)
18/01/03 10:43:17 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on slave2:52849 (s
4000
ize: 12.8 KB, free: 146.2 MB)
18/01/03 10:43:17 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on slave3:33238 (size: 12.8 KB, free: 146.2 MB)
18/01/03 10:43:19 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 3138 ms on slave3 (1/2)
18/01/03 10:43:20 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 3497 ms on slave2 (2/2)
18/01/03 10:43:20 INFO scheduler.DAGScheduler: ShuffleMapStage 0 (map at <console>:29) finished in 3.507 s
18/01/03 10:43:20 INFO scheduler.DAGScheduler: looking for newly runnable stages
18/01/03 10:43:20 INFO scheduler.DAGScheduler: running: Set()
18/01/03 10:43:20 INFO scheduler.DAGScheduler: waiting: Set(ResultStage 1)
18/01/03 10:43:20 INFO scheduler.DAGScheduler: failed: Set()
18/01/03 10:43:20 INFO scheduler.DAGScheduler: Submitting ResultStage 1 (ShuffledRDD[4] at reduceByKey at <console>:29), which has no missing parents
18/01/03 10:43:20 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
18/01/03 10:43:20 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 2.6 KB, free 517.3 MB)
18/01/03 10:43:20 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1589.0 B, free 517.3 MB)
18/01/03 10:43:20 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.50.131:33833 (size: 1589.0 B, free: 517.4 MB)
18/01/03 10:43:20 INFO spark.SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1006
18/01/03 10:43:20 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 1 (ShuffledRDD[4] at reduceByKey at <console>:29)
18/01/03 10:43:20 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
18/01/03 10:43:20 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, slave2, partition 0,NODE_LOCAL, 1894 bytes)
18/01/03 10:43:20 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, slave3, partition 1,NODE_LOCAL, 1894 bytes)
18/01/03 10:43:20 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on slave2:52849 (size: 1589.0 B, free: 146.2 MB)
18/01/03 10:43:20 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to slave2:40754
18/01/03 10:43:20 INFO spark.MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 156 bytes
18/01/03 10:43:20 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on slave3:33238 (size: 1589.0 B, free: 146.2 MB)
18/01/03 10:43:20 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to slave3:36989
18/01/03 10:43:20 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 485 ms on slave2 (1/2)
18/01/03 10:43:20 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 539 ms on slave3 (2/2)
18/01/03 10:43:20 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
18/01/03 10:43:20 INFO scheduler.DAGScheduler: ResultStage 1 (collect at <console>:32) finished in 0.556 s
18/01/03 10:43:20 INFO scheduler.DAGScheduler: Job 0 finished: collect at <console>:32, took 4.393500 s
res3: Array[(String, Int)] = Array((package,1), (this,1), (Version"](http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version),1), (Because,1), (Python,2), (cluster.,1), (its,1), ([run,1), (general,2), (have,1), (pre-built,1), (YARN,,1), (locally,2), (changed,1), (locally.,1), (sc.parallelize(1,1), (only,1), (several,1), (This,2), (basic,1), (Configuration,1), (learning,,1), (documentation,3), (first,1), (graph,1), (Hive,2), (["Specifying,1), ("yarn",1), (page](http://spark.apache.org/documentation.html),1), ([params]`.,1), ([project,2), (prefer,1), (SparkPi,2), (<http://spark.apache.org/>,1), (engine,1), (version,1), (file,1), (documentation,,1), (MASTER,1), (example,3), (are,1), (systems.,1), (params,1), (scala>,1), (DataFrames,,1), (provides,1), (refer,2)...
scala>






                                            
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: