SPARK-Shell 用Scala执行WordCount
2018-01-03 10:45
387 查看
[root@master spark]# cd /usr/spark [root@master spark]# ./bin/spark-shell 18/01/03 11:38:39 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 18/01/03 11:38:40 INFO spark.SecurityManager: Changing view acls to: root 18/01/03 11:38:40 INFO spark.SecurityManager: Changing modify acls to: root 18/01/03 11:38:40 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root) 18/01/03 11:38:40 INFO spark.HttpServer: Starting HTTP Server 18/01/03 11:38:40 INFO server.Server: jetty-8.y.z-SNAPSHOT 18/01/03 11:38:40 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:55442 18/01/03 11:38:40 INFO util.Utils: Successfully started service 'HTTP class server' on port 55442. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.6.3 /_/ Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_152) Type in expressions to have them evaluated. Type :help for more information. ...... Spark context available as sc. 解释器已帮我们初始化了一个SparkContext对象的实体,名为sc
--从本地读取,确保每个节点都存在此文件 scala> val textFile = sc.textFile("file:///usr/spark/README.md"); --从HDFS上读取, hdfs://master:9000/user/root/README.md--上传本地文件文件到hdfs,hadoop fs -put /usr/spark/README.md /user/root; scala> val textFile = sc.textFile("README.md"); scala> val wordCounts = textFile.flatMap(line=>line.split(" ")).map(word=>(word,1)).reduceByKey((a,b)=>a+b) 18/01/03 10:43:04 INFO mapred.FileInputFormat: Total input paths to process : 1 wordCounts: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[4] at reduceByKey at <console>:29
scala> wordCounts.collect(); 18/01/03 10:43:16 INFO spark.SparkContext: Starting job: collect at <console>:32 18/01/03 10:43:16 INFO scheduler.DAGScheduler: Registering RDD 3 (map at <console>:29) 18/01/03 10:43:16 INFO scheduler.DAGScheduler: Got job 0 (collect at <console>:32) with 2 output partitions 18/01/03 10:43:16 INFO scheduler.DAGScheduler: Final stage: ResultStage 1 (collect at <console>:32) 18/01/03 10:43:16 INFO scheduler.DAGScheduler: Parents of final stage: List(ShuffleMapStage 0) 18/01/03 10:43:16 INFO scheduler.DAGScheduler: Missing parents: List(ShuffleMapStage 0) 18/01/03 10:43:16 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[3] at map at <console>:29), which has no missing parents 18/01/03 10:43:16 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.2 KB, free 517.3 MB) 18/01/03 10:43:16 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.3 KB, free 517.3 MB) 18/01/03 10:43:16 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.50.131:33833 (size: 2.3 KB, free: 517.4 MB) 18/01/03 10:43:16 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006 18/01/03 10:43:16 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[3] at map at <console>:29) 18/01/03 10:43:16 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 2 tasks 18/01/03 10:43:16 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, slave2, partition 0,NODE_LOCAL, 2129 bytes) 18/01/03 10:43:16 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, slave3, partition 1,NODE_LOCAL, 2129 bytes) 18/01/03 10:43:17 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on slave2:52849 (size: 2.3 KB, free: 146.2 MB) 18/01/03 10:43:17 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on slave3:33238 (size: 2.3 KB, free: 146.2 MB) 18/01/03 10:43:17 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on slave2:52849 (s 4000 ize: 12.8 KB, free: 146.2 MB) 18/01/03 10:43:17 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on slave3:33238 (size: 12.8 KB, free: 146.2 MB) 18/01/03 10:43:19 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 3138 ms on slave3 (1/2) 18/01/03 10:43:20 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 3497 ms on slave2 (2/2) 18/01/03 10:43:20 INFO scheduler.DAGScheduler: ShuffleMapStage 0 (map at <console>:29) finished in 3.507 s 18/01/03 10:43:20 INFO scheduler.DAGScheduler: looking for newly runnable stages 18/01/03 10:43:20 INFO scheduler.DAGScheduler: running: Set() 18/01/03 10:43:20 INFO scheduler.DAGScheduler: waiting: Set(ResultStage 1) 18/01/03 10:43:20 INFO scheduler.DAGScheduler: failed: Set() 18/01/03 10:43:20 INFO scheduler.DAGScheduler: Submitting ResultStage 1 (ShuffledRDD[4] at reduceByKey at <console>:29), which has no missing parents 18/01/03 10:43:20 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 18/01/03 10:43:20 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 2.6 KB, free 517.3 MB) 18/01/03 10:43:20 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1589.0 B, free 517.3 MB) 18/01/03 10:43:20 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.50.131:33833 (size: 1589.0 B, free: 517.4 MB) 18/01/03 10:43:20 INFO spark.SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1006 18/01/03 10:43:20 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 1 (ShuffledRDD[4] at reduceByKey at <console>:29) 18/01/03 10:43:20 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 2 tasks 18/01/03 10:43:20 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, slave2, partition 0,NODE_LOCAL, 1894 bytes) 18/01/03 10:43:20 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, slave3, partition 1,NODE_LOCAL, 1894 bytes) 18/01/03 10:43:20 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on slave2:52849 (size: 1589.0 B, free: 146.2 MB) 18/01/03 10:43:20 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to slave2:40754 18/01/03 10:43:20 INFO spark.MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 156 bytes 18/01/03 10:43:20 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on slave3:33238 (size: 1589.0 B, free: 146.2 MB) 18/01/03 10:43:20 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to slave3:36989 18/01/03 10:43:20 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 485 ms on slave2 (1/2) 18/01/03 10:43:20 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 539 ms on slave3 (2/2) 18/01/03 10:43:20 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 18/01/03 10:43:20 INFO scheduler.DAGScheduler: ResultStage 1 (collect at <console>:32) finished in 0.556 s 18/01/03 10:43:20 INFO scheduler.DAGScheduler: Job 0 finished: collect at <console>:32, took 4.393500 s res3: Array[(String, Int)] = Array((package,1), (this,1), (Version"](http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version),1), (Because,1), (Python,2), (cluster.,1), (its,1), ([run,1), (general,2), (have,1), (pre-built,1), (YARN,,1), (locally,2), (changed,1), (locally.,1), (sc.parallelize(1,1), (only,1), (several,1), (This,2), (basic,1), (Configuration,1), (learning,,1), (documentation,3), (first,1), (graph,1), (Hive,2), (["Specifying,1), ("yarn",1), (page](http://spark.apache.org/documentation.html),1), ([params]`.,1), ([project,2), (prefer,1), (SparkPi,2), (<http://spark.apache.org/>,1), (engine,1), (version,1), (file,1), (documentation,,1), (MASTER,1), (example,3), (are,1), (systems.,1), (params,1), (scala>,1), (DataFrames,,1), (provides,1), (refer,2)... scala>
相关文章推荐
- sparkshell中执行wordcount
- 分别用Java、Scala、spark-shell开发wordcount程序及测试代码
- scala sparkstreaming wordcount
- scala-eclipse 编写spark简单程序 WordCount
- 启动Spark Shell,在Spark Shell中编写WordCount程序,在IDEA中编写WordCount的Maven程序,spark-submit使用spark的jar来做单词统计
- python、scala、java分别实现在spark上实现WordCount
- 从源码剖析一个Spark WordCount Job执行的全过程
- Spark wordcount - Python, Scala, Java
- spark shell中编写WordCount程序
- sparkshell里的wordcount
- 2 大数据实战系列-spark shell wordcount
- maven构建Scala程序,实现spark的wordcount
- Spark-shell初体验:WordCount
- Spark 程序 WordCount实现 Scala、Python
- 第一个spark scala程序——wordcount
- idea+maven+scala创建wordcount,打包jar并在spark on yarn上运行
- idea+maven+scala创建wordcount,打包jar并在spark on yarn上运行(可以使用)
- spark小应用一:wordcount,按词频降序(SCALA)
- 从源码剖析一个Spark WordCount Job执行的全过程
- Spark Run WordCount On Hdfs using Scala