您的位置:首页 > 编程语言 > Java开发

Spark组件之SparkR学习4--Eclipse下R语言环境搭建

2016-04-20 13:12 489 查看
更多代码请见:https://github.com/xubo245/SparkLearning

1.下载R地址:

eclipse下

http://download.walware.de/eclipse-4.3/


Learning R这本书上第5页有讲从http://www.walware.de/goto/statet下载,但是没试成功,不确定是否可行

2.配置:

参考【2】,注意:Run Configurations...”,打开的弹出窗口中,新建一个"R Console“是在左边选择

3.运行:

3.1 hello R

# TODO: Add comment
#
# Author: xubo
###############################################################################

print("hello R by eclipse")


运行结果:

cat("Synch19482817557563\n");

R version 3.2.1 (2015-06-18) -- "World-Famous Astronaut"
Copyright (C) 2015 The R Foundation for Statistical Computing
Platform: x86_64-w64-mingw32/x64 (64-bit)

R是自由软件,不带任何担保。
在某些条件下你可以将其自由散布。
用'license()'或'licence()'来看散布的详细条件。

R是个合作计划,有许多人为之做出了贡献.
用'contributors()'来看合作者的详细情况
用'citation()'会告诉你如何在出版物中正确地引用R或R程序包。

用'demo()'来看一些示范程序,用'help()'来阅读在线帮助文件,或
用'help.start()'通过HTML浏览器来看帮助文件。
用'q()'退出R.

> Synch19482817557563
> Sys.getpid()
[1] 5604
> cat("Synch19483617588263\n");
Synch19483617588263
> getwd()
[1] "D:/all/R"
> # TODO: Add comment
> #
> # Author: xubo
> ###############################################################################
>
>
> print("hello R")
[1] "hello R"
> source("D:/all/eclipse432/RTest/test1.R", echo=FALSE, encoding="UTF-8")
[1] "hello R by eclipse"
>


3.2 SparkR

3.2.1 Json 

# TODO: Add comment
#
# Author: xubo
###############################################################################
Sys.setenv(SPARK_HOME="D:/1win7/java/spark-1.5.2-bin-hadoop2.6")
# This line loads SparkR from the installed directory
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R","lib"), .libPaths()))
library(SparkR)
sc <- sparkR.init(master="local")
sqlContext <- sparkRSQL.init(sc)
print("SparkR")
df <- createDataFrame(sqlContext, faithful)
head(df)
print(df)
people <- read.df(sqlContext, "D:/all/R/examples/src/main/resources/people.json", "json")
head(people)
print(people)
printSchema(people)
print("hello R by eclipse")


运行结果:

cat("Synch19482817557563\n");

R version 3.2.1 (2015-06-18) -- "World-Famous Astronaut"
Copyright (C) 2015 The R Foundation for Statistical Computing
Platform: x86_64-w64-mingw32/x64 (64-bit)

R是自由软件,不带任何担保。
在某些条件下你可以将其自由散布。
用'license()'或'licence()'来看散布的详细条件。

R是个合作计划,有许多人为之做出了贡献.
用'contributors()'来看合作者的详细情况
用'citation()'会告诉你如何在出版物中正确地引用R或R程序包。

用'demo()'来看一些示范程序,用'help()'来阅读在线帮助文件,或
用'help.start()'通过HTML浏览器来看帮助文件。
用'q()'退出R.

> Synch19482817557563
> Sys.getpid()
[1] 5604
> cat("Synch19483617588263\n");
Synch19483617588263
> getwd()
[1] "D:/all/R"
> # TODO: Add comment
> #
> # Author: xubo
> ###############################################################################
>
>
> print("hello R")
[1] "hello R"
> source("D:/all/eclipse432/RTest/test1.R", echo=FALSE, encoding="UTF-8")
[1] "hello R by eclipse"
>source("D:/all/eclipse432/RTest/test1.R", echo=FALSE, encoding="UTF-8")

载入程辑包:'SparkR'

The following objects are masked from 'package:stats':

filter, na.omit

The following objects are masked from 'package:base':

intersect, rbind, sample, subset, summary, table, transform

Launching java with spark-submit command D:/1win7/java/spark-1.5.2-bin-hadoop2.6/bin/spark-submit.cmd sparkr-shell C:\Users\xubo\AppData\Local\Temp\RtmpKSEIVI\backend_port15e45dd73fcc
log4j:WARN No appenders could be found for logger (io.netty.util.internal.logging.InternalLoggerFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/04/20 15:20:30 INFO SparkContext: Running Spark version 1.5.2
16/04/20 15:20:32 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/04/20 15:20:32 INFO SecurityManager: Changing view acls to: xubo
16/04/20 15:20:32 INFO SecurityManager: Changing modify acls to: xubo
16/04/20 15:20:32 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(xubo); users with modify permissions: Set(xubo)
16/04/20 15:20:36 INFO Slf4jLogger: Slf4jLogger started
16/04/20 15:20:37 INFO Remoting: Starting remoting
16/04/20 15:20:37 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@202.38.84.125:57831]
16/04/20 15:20:37 INFO Utils: Successfully started service 'sparkDriver' on port 57831.
16/04/20 15:20:37 INFO SparkEnv: Registering MapOutputTracker
16/04/20 15:20:38 INFO SparkEnv: Registering BlockManagerMaster
16/04/20 15:20:38 INFO DiskBlockManager: Created local directory at C:\Users\xubo\AppData\Local\Temp\blockmgr-256ba758-c75b-4f0b-9486-dbf927f96082
16/04/20 15:20:38 INFO MemoryStore: MemoryStore started with capacity 529.9 MB
16/04/20 15:20:38 INFO HttpFileServer: HTTP File server directory is C:\Users\xubo\AppData\Local\Temp\spark-4580ac0f-0627-436b-aa4d-2c5c77a39d81\httpd-d299ff42-579f-40ad-98ca-43f4576187ee
16/04/20 15:20:38 INFO HttpServer: Starting HTTP Server
16/04/20 15:20:38 INFO Utils: Successfully started service 'HTTP file server' on port 57836.
16/04/20 15:20:38 INFO SparkEnv: Registering OutputCommitCoordinator
16/04/20 15:20:39 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/04/20 15:20:39 INFO SparkUI: Started SparkUI at http://202.38.84.125:4040 16/04/20 15:20:40 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
16/04/20 15:20:40 INFO Executor: Starting executor ID driver on host localhost
16/04/20 15:20:40 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 57845.
16/04/20 15:20:40 INFO NettyBlockTransferService: Server created on 57845
16/04/20 15:20:40 INFO BlockManagerMaster: Trying to register BlockManager
16/04/20 15:20:40 INFO BlockManagerMasterEndpoint: Registering block manager localhost:57845 with 529.9 MB RAM, BlockManagerId(driver, localhost, 57845)
16/04/20 15:20:40 INFO BlockManagerMaster: Registered BlockManager
[1] "SparkR"
16/04/20 15:20:43 INFO SparkContext: Starting job: collectPartitions at NativeMethodAccessorImpl.java:-2
16/04/20 15:20:43 INFO DAGScheduler: Got job 0 (collectPartitions at NativeMethodAccessorImpl.java:-2) with 1 output partitions
16/04/20 15:20:43 INFO DAGScheduler: Final stage: ResultStage 0(collectPartitions at NativeMethodAccessorImpl.java:-2)
16/04/20 15:20:43 INFO DAGScheduler: Parents of final stage: List()
16/04/20 15:20:43 INFO DAGScheduler: Missing parents: List()
16/04/20 15:20:43 INFO DAGScheduler: Submitting ResultStage 0 (ParallelCollectionRDD[0] at parallelize at RRDD.scala:454), which has no missing parents
16/04/20 15:20:44 INFO MemoryStore: ensureFreeSpace(1280) called with curMem=0, maxMem=555684986
16/04/20 15:20:44 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 1280.0 B, free 529.9 MB)
16/04/20 15:20:44 INFO MemoryStore: ensureFreeSpace(854) called with curMem=1280, maxMem=555684986
16/04/20 15:20:44 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 854.0 B, free 529.9 MB)
16/04/20 15:20:44 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:57845 (size: 854.0 B, free: 529.9 MB)
16/04/20 15:20:44 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:861
16/04/20 15:20:44 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (ParallelCollectionRDD[0] at parallelize at RRDD.scala:454)
16/04/20 15:20:44 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
16/04/20 15:20:44 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 12983 bytes)
16/04/20 15:20:44 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
16/04/20 15:20:45 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 11877 bytes result sent to driver
16/04/20 15:20:45 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 320 ms on localhost (1/1)
16/04/20 15:20:45 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/04/20 15:20:45 INFO DAGScheduler: ResultStage 0 (collectPartitions at NativeMethodAccessorImpl.java:-2) finished in 0.372 s
16/04/20 15:20:45 INFO DAGScheduler: Job 0 finished: collectPartitions at NativeMethodAccessorImpl.java:-2, took 2.005717 s
16/04/20 15:20:47 INFO SparkContext: Starting job: dfToCols at NativeMethodAccessorImpl.java:-2
16/04/20 15:20:47 INFO DAGScheduler: Got job 1 (dfToCols at NativeMethodAccessorImpl.java:-2) with 1 output partitions
16/04/20 15:20:47 INFO DAGScheduler: Final stage: ResultStage 1(dfToCols at NativeMethodAccessorImpl.java:-2)
16/04/20 15:20:47 INFO DAGScheduler: Parents of final stage: List()
16/04/20 15:20:47 INFO DAGScheduler: Missing parents: List()
16/04/20 15:20:47 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[4] at dfToCols at NativeMethodAccessorImpl.java:-2), which has no missing parents
16/04/20 15:20:47 INFO MemoryStore: ensureFreeSpace(8912) called with curMem=2134, maxMem=555684986
16/04/20 15:20:47 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 8.7 KB, free 529.9 MB)
16/04/20 15:20:47 INFO MemoryStore: ensureFreeSpace(3612) called with curMem=11046, maxMem=555684986
16/04/20 15:20:47 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 3.5 KB, free 529.9 MB)
16/04/20 15:20:47 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:57845 (size: 3.5 KB, free: 529.9 MB)
16/04/20 15:20:47 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:861
16/04/20 15:20:47 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (MapPartitionsRDD[4] at dfToCols at NativeMethodAccessorImpl.java:-2)
16/04/20 15:20:47 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
16/04/20 15:20:47 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, PROCESS_LOCAL, 12983 bytes)
16/04/20 15:20:47 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
16/04/20 15:20:48 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 1461 bytes result sent to driver
16/04/20 15:20:48 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 992 ms on localhost (1/1)
16/04/20 15:20:48 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
16/04/20 15:20:48 INFO DAGScheduler: ResultStage 1 (dfToCols at NativeMethodAccessorImpl.java:-2) finished in 1.002 s
16/04/20 15:20:48 INFO DAGScheduler: Job 1 finished: dfToCols at NativeMethodAccessorImpl.java:-2, took 1.049348 s
DataFrame[eruptions:double, waiting:double]
16/04/20 15:20:50 WARN : Your hostname, xubo-PC resolves to a loopback/non-reachable address: fe80:0:0:0:482:722f:5976:ce1f%39, but we couldn't find any external IP address!
16/04/20 15:20:50 INFO JSONRelation: Listing file:/D:/all/R/examples/src/main/resources/people.json on driver
16/04/20 15:20:50 INFO MemoryStore: ensureFreeSpace(231232) called with curMem=14658, maxMem=555684986
16/04/20 15:20:50 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 225.8 KB, free 529.7 MB)
16/04/20 15:20:50 INFO MemoryStore: ensureFreeSpace(19908) called with curMem=245890, maxMem=555684986
16/04/20 15:20:50 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 19.4 KB, free 529.7 MB)
16/04/20 15:20:50 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:57845 (size: 19.4 KB, free: 529.9 MB)
16/04/20 15:20:50 INFO SparkContext: Created broadcast 2 from loadDF at NativeMethodAccessorImpl.java:-2
16/04/20 15:20:51 INFO FileInputFormat: Total input paths to process : 1
16/04/20 15:20:51 INFO SparkContext: Starting job: loadDF at NativeMethodAccessorImpl.java:-2
16/04/20 15:20:51 INFO DAGScheduler: Got job 2 (loadDF at NativeMethodAccessorImpl.java:-2) with 1 output partitions
16/04/20 15:20:51 INFO DAGScheduler: Final stage: ResultStage 2(loadDF at NativeMethodAccessorImpl.java:-2)
16/04/20 15:20:51 INFO DAGScheduler: Parents of final stage: List()
16/04/20 15:20:51 INFO DAGScheduler: Missing parents: List()
16/04/20 15:20:51 INFO DAGScheduler: Submitting ResultStage 2 (MapPartitionsRDD[8] at loadDF at NativeMethodAccessorImpl.java:-2), which has no missing parents
16/04/20 15:20:51 INFO MemoryStore: ensureFreeSpace(4056) called with curMem=265798, maxMem=555684986
16/04/20 15:20:51 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 4.0 KB, free 529.7 MB)
16/04/20 15:20:51 INFO MemoryStore: ensureFreeSpace(2295) called with curMem=269854, maxMem=555684986
16/04/20 15:20:51 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 2.2 KB, free 529.7 MB)
16/04/20 15:20:51 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:57845 (size: 2.2 KB, free: 529.9 MB)
16/04/20 15:20:51 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:861
16/04/20 15:20:51 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 2 (MapPartitionsRDD[8] at loadDF at NativeMethodAccessorImpl.java:-2)
16/04/20 15:20:51 INFO TaskSchedulerImpl: Adding task set 2.0 with 1 tasks
16/04/20 15:20:51 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 2, localhost, PROCESS_LOCAL, 2163 bytes)
16/04/20 15:20:51 INFO Executor: Running task 0.0 in stage 2.0 (TID 2)
16/04/20 15:20:51 INFO HadoopRDD: Input split: file:/D:/all/R/examples/src/main/resources/people.json:0+73
16/04/20 15:20:51 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
16/04/20 15:20:51 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
16/04/20 15:20:51 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
16/04/20 15:20:51 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
16/04/20 15:20:51 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
16/04/20 15:20:51 INFO Executor: Finished task 0.0 in stage 2.0 (TID 2). 2845 bytes result sent to driver
16/04/20 15:20:51 INFO DAGScheduler: ResultStage 2 (loadDF at NativeMethodAccessorImpl.java:-2) finished in 0.418 s
16/04/20 15:20:51 INFO DAGScheduler: Job 2 finished: loadDF at NativeMethodAccessorImpl.java:-2, took 0.431294 s
16/04/20 15:20:51 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 2) in 417 ms on localhost (1/1)
16/04/20 15:20:51 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool
16/04/20 15:20:51 INFO MemoryStore: ensureFreeSpace(88512) called with curMem=272149, maxMem=555684986
16/04/20 15:20:51 INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 86.4 KB, free 529.6 MB)
16/04/20 15:20:51 INFO MemoryStore: ensureFreeSpace(19788) called with curMem=360661, maxMem=555684986
16/04/20 15:20:51 INFO MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 19.3 KB, free 529.6 MB)
16/04/20 15:20:51 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on localhost:57845 (size: 19.3 KB, free: 529.9 MB)
16/04/20 15:20:51 INFO SparkContext: Created broadcast 4 from dfToCols at NativeMethodAccessorImpl.java:-2
16/04/20 15:20:51 INFO MemoryStore: ensureFreeSpace(231232) called with curMem=380449, maxMem=555684986
16/04/20 15:20:51 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 225.8 KB, free 529.4 MB)
16/04/20 15:20:51 INFO MemoryStore: ensureFreeSpace(19908) called with curMem=611681, maxMem=555684986
16/04/20 15:20:51 INFO MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 19.4 KB, free 529.3 MB)
16/04/20 15:20:51 INFO BlockManagerInfo: Added broadcast_5_piece0 in memory on localhost:57845 (size: 19.4 KB, free: 529.9 MB)
16/04/20 15:20:51 INFO SparkContext: Created broadcast 5 from dfToCols at NativeMethodAccessorImpl.java:-2
16/04/20 15:20:51 INFO FileInputFormat: Total input paths to process : 1
16/04/20 15:20:51 INFO SparkContext: Starting job: dfToCols at NativeMethodAccessorImpl.java:-2
16/04/20 15:20:51 INFO DAGScheduler: Got job 3 (dfToCols at NativeMethodAccessorImpl.java:-2) with 1 output partitions
16/04/20 15:20:51 INFO DAGScheduler: Final stage: ResultStage 3(dfToCols at NativeMethodAccessorImpl.java:-2)
16/04/20 15:20:51 INFO DAGScheduler: Parents of final stage: List()
16/04/20 15:20:51 INFO DAGScheduler: Missing parents: List()
16/04/20 15:20:51 INFO DAGScheduler: Submitting ResultStage 3 (MapPartitionsRDD[12] at dfToCols at NativeMethodAccessorImpl.java:-2), which has no missing parents
16/04/20 15:20:51 INFO MemoryStore: ensureFreeSpace(3960) called with curMem=631589, maxMem=555684986
16/04/20 15:20:51 INFO MemoryStore: Block broadcast_6 stored as values in memory (estimated size 3.9 KB, free 529.3 MB)
16/04/20 15:20:51 INFO MemoryStore: ensureFreeSpace(2301) called with curMem=635549, maxMem=555684986
16/04/20 15:20:51 INFO MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 2.2 KB, free 529.3 MB)
16/04/20 15:20:51 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on localhost:57845 (size: 2.2 KB, free: 529.9 MB)
16/04/20 15:20:51 INFO SparkContext: Created broadcast 6 from broadcast at DAGScheduler.scala:861
16/04/20 15:20:51 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 3 (MapPartitionsRDD[12] at dfToCols at NativeMethodAccessorImpl.java:-2)
16/04/20 15:20:51 INFO TaskSchedulerImpl: Adding task set 3.0 with 1 tasks
16/04/20 15:20:51 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID 3, localhost, PROCESS_LOCAL, 2163 bytes)
16/04/20 15:20:51 INFO Executor: Running task 0.0 in stage 3.0 (TID 3)
16/04/20 15:20:52 INFO HadoopRDD: Input split: file:/D:/all/R/examples/src/main/resources/people.json:0+73
16/04/20 15:20:52 INFO Executor: Finished task 0.0 in stage 3.0 (TID 3). 2508 bytes result sent to driver
16/04/20 15:20:52 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 3) in 15 ms on localhost (1/1)
16/04/20 15:20:52 INFO TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool
16/04/20 15:20:52 INFO DAGScheduler: ResultStage 3 (dfToCols at NativeMethodAccessorImpl.java:-2) finished in 0.016 s
16/04/20 15:20:52 INFO DAGScheduler: Job 3 finished: dfToCols at NativeMethodAccessorImpl.java:-2, took 0.029673 s
DataFrame[age:bigint, name:string]
root
|-- age: long (nullable = true)
|-- name: string (nullable = true)
[1] "hello R by eclipse"
>


3.2.2 Hivesql

hiveContext <- sparkRHive.init(sc)
16/04/20 15:22:30 INFO HiveContext: Initializing execution hive, version 1.2.1
16/04/20 15:22:31 INFO ClientWrapper: Inspected Hadoop version: 2.6.0
16/04/20 15:22:31 INFO ClientWrapper: Loaded org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.6.0
16/04/20 15:22:32 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
16/04/20 15:22:32 INFO ObjectStore: ObjectStore, initialize called
16/04/20 15:22:33 INFO Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
16/04/20 15:22:33 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored
16/04/20 15:22:33 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/04/20 15:22:34 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/04/20 15:22:43 INFO BlockManagerInfo: Removed broadcast_6_piece0 on localhost:57845 in memory (size: 2.2 KB, free: 529.9 MB)
16/04/20 15:22:43 INFO ContextCleaner: Cleaned accumulator 4
16/04/20 15:22:43 INFO BlockManagerInfo: Removed broadcast_5_piece0 on localhost:57845 in memory (size: 19.4 KB, free: 529.9 MB)
16/04/20 15:22:43 INFO BlockManagerInfo: Removed broadcast_4_piece0 on localhost:57845 in memory (size: 19.3 KB, free: 529.9 MB)
16/04/20 15:22:43 INFO BlockManagerInfo: Removed broadcast_3_piece0 on localhost:57845 in memory (size: 2.2 KB, free: 529.9 MB)
16/04/20 15:22:43 INFO ContextCleaner: Cleaned accumulator 3
16/04/20 15:22:43 INFO BlockManagerInfo: Removed broadcast_2_piece0 on localhost:57845 in memory (size: 19.4 KB, free: 529.9 MB)
16/04/20 15:22:43 INFO BlockManagerInfo: Removed broadcast_1_piece0 on localhost:57845 in memory (size: 3.5 KB, free: 529.9 MB)
16/04/20 15:22:43 INFO ContextCleaner: Cleaned accumulator 2
16/04/20 15:22:43 INFO BlockManagerInfo: Removed broadcast_0_piece0 on localhost:57845 in memory (size: 854.0 B, free: 529.9 MB)
16/04/20 15:22:43 INFO ContextCleaner: Cleaned accumulator 1
16/04/20 15:22:43 INFO ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
16/04/20 15:22:45 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
16/04/20 15:22:45 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
16/04/20 15:22:52 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
16/04/20 15:22:52 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
16/04/20 15:22:54 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY
16/04/20 15:22:54 INFO ObjectStore: Initialized ObjectStore
16/04/20 15:22:55 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
16/04/20 15:22:55 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
16/04/20 15:22:55 INFO HiveMetaStore: Added admin role in metastore
16/04/20 15:22:55 INFO HiveMetaStore: Added public role in metastore
16/04/20 15:22:56 INFO HiveMetaStore: No user is added in admin role, since config is empty
16/04/20 15:22:56 INFO HiveMetaStore: 0: get_all_databases
16/04/20 15:22:56 INFO audit: ugi=xubo	ip=unknown-ip-addr	cmd=get_all_databases
16/04/20 15:22:56 INFO HiveMetaStore: 0: get_functions: db=default pat=*
16/04/20 15:22:56 INFO audit: ugi=xubo	ip=unknown-ip-addr	cmd=get_functions: db=default pat=*
16/04/20 15:22:56 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as "embedded-only" so does not have its own datastore table.
16/04/20 15:22:58 INFO SessionState: Created HDFS directory: /tmp/hive/xubo
16/04/20 15:22:58 INFO SessionState: Created local directory: C:/Users/xubo/AppData/Local/Temp/02226f74-e30d-4aba-869d-47b1823116a9_resources
16/04/20 15:22:58 INFO SessionState: Created HDFS directory: /tmp/hive/xubo/02226f74-e30d-4aba-869d-47b1823116a9
16/04/20 15:22:58 INFO SessionState: Created local directory: C:/Users/xubo/AppData/Local/Temp/xubo/02226f74-e30d-4aba-869d-47b1823116a9
16/04/20 15:22:58 INFO SessionState: Created HDFS directory: /tmp/hive/xubo/02226f74-e30d-4aba-869d-47b1823116a9/_tmp_space.db
16/04/20 15:22:58 INFO HiveContext: default warehouse location is /user/hive/warehouse
16/04/20 15:22:59 INFO HiveContext: Initializing HiveMetastoreConnection version 1.2.1 using Spark classes.
16/04/20 15:22:59 INFO ClientWrapper: Inspected Hadoop version: 2.6.0
16/04/20 15:22:59 INFO ClientWrapper: Loaded org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.6.0
16/04/20 15:23:00 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/04/20 15:23:00 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
16/04/20 15:23:00 INFO ObjectStore: ObjectStore, initialize called
16/04/20 15:23:00 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored
16/04/20 15:23:00 INFO Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
16/04/20 15:23:00 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/04/20 15:23:00 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
sql(hiveContext, "CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
16/04/20 15:23:19 INFO ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
16/04/20 15:23:20 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
16/04/20 15:23:20 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
16/04/20 15:23:30 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
16/04/20 15:23:30 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
16/04/20 15:23:33 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY
16/04/20 15:23:33 INFO ObjectStore: Initialized ObjectStore
16/04/20 15:23:33 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
16/04/20 15:23:33 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
16/04/20 15:23:35 WARN : Your hostname, xubo-PC resolves to a loopback/non-reachable address: fe80:0:0:0:482:722f:5976:ce1f%39, but we couldn't find any external IP address!
16/04/20 15:23:36 INFO HiveMetaStore: Added admin role in metastore
16/04/20 15:23:36 INFO HiveMetaStore: Added public role in metastore
16/04/20 15:23:37 INFO HiveMetaStore: No user is added in admin role, since config is empty
16/04/20 15:23:37 INFO HiveMetaStore: 0: get_all_databases
16/04/20 15:23:37 INFO audit: ugi=xubo	ip=unknown-ip-addr	cmd=get_all_databases
16/04/20 15:23:37 INFO HiveMetaStore: 0: get_functions: db=default pat=*
16/04/20 15:23:37 INFO audit: ugi=xubo	ip=unknown-ip-addr	cmd=get_functions: db=default pat=*
16/04/20 15:23:37 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as "embedded-only" so does not have its own datastore table.
16/04/20 15:23:38 INFO SessionState: Created local directory: C:/Users/xubo/AppData/Local/Temp/46ffda99-1cc4-43e8-bb0f-407d5c7997fd_resources
16/04/20 15:23:38 INFO SessionState: Created HDFS directory: /tmp/hive/xubo/46ffda99-1cc4-43e8-bb0f-407d5c7997fd
16/04/20 15:23:38 INFO SessionState: Created local directory: C:/Users/xubo/AppData/Local/Temp/xubo/46ffda99-1cc4-43e8-bb0f-407d5c7997fd
16/04/20 15:23:39 INFO SessionState: Created HDFS directory: /tmp/hive/xubo/46ffda99-1cc4-43e8-bb0f-407d5c7997fd/_tmp_space.db
> 16/04/20 15:23:39 INFO ParseDriver: Parsing command: CREATE TABLE IF NOT EXISTS src (key INT, value STRING)
16/04/20 15:23:40 INFO ParseDriver: Parse Completed
16/04/20 15:23:40 INFO PerfLogger: <PERFLOG method=Driver.run from=org.apache.hadoop.hive.ql.Driver>
16/04/20 15:23:40 INFO PerfLogger: <PERFLOG method=TimeToSubmit from=org.apache.hadoop.hive.ql.Driver>
16/04/20 15:23:40 INFO PerfLogger: <PERFLOG method=compile from=org.apache.hadoop.hive.ql.Driver>
16/04/20 15:23:40 INFO PerfLogger: <PERFLOG method=parse from=org.apache.hadoop.hive.ql.Driver>
16/04/20 15:23:40 INFO ParseDriver: Parsing command: CREATE TABLE IF NOT EXISTS src (key INT, value STRING)
16/04/20 15:23:41 INFO ParseDriver: Parse Completed
16/04/20 15:23:41 INFO PerfLogger: </PERFLOG method=parse start=1461137020801 end=1461137021457 duration=656 from=org.apache.hadoop.hive.ql.Driver>
16/04/20 15:23:41 INFO PerfLogger: <PERFLOG method=semanticAnalyze from=org.apache.hadoop.hive.ql.Driver>
16/04/20 15:23:41 INFO CalcitePlanner: Starting Semantic Analysis
16/04/20 15:23:41 INFO CalcitePlanner: Creating table default.src position=27
16/04/20 15:23:41 INFO HiveMetaStore: 0: get_table : db=default tbl=src
16/04/20 15:23:41 INFO audit: ugi=xubo	ip=unknown-ip-addr	cmd=get_table : db=default tbl=src
16/04/20 15:23:41 INFO HiveMetaStore: 0: get_database: default
16/04/20 15:23:41 INFO audit: ugi=xubo	ip=unknown-ip-addr	cmd=get_database: default
16/04/20 15:23:42 INFO Driver: Semantic Analysis Completed
16/04/20 15:23:42 INFO PerfLogger: </PERFLOG method=semanticAnalyze start=1461137021467 end=1461137022279 duration=812 from=org.apache.hadoop.hive.ql.Driver>
16/04/20 15:23:42 INFO Driver: Returning Hive schema: Schema(fieldSchemas:null, properties:null)
16/04/20 15:23:42 INFO PerfLogger: </PERFLOG method=compile start=1461137020590 end=1461137022293 duration=1703 from=org.apache.hadoop.hive.ql.Driver>
16/04/20 15:23:42 INFO Hive: Dumping metastore api call timing information for : compilation phase
16/04/20 15:23:42 INFO Hive: Total time spent in this metastore function was greater than 1000ms : getFunctions_(String, String, )=1434
16/04/20 15:23:42 INFO Driver: Concurrency mode is disabled, not creating a lock manager
16/04/20 15:23:42 INFO PerfLogger: <PERFLOG method=Driver.execute from=org.apache.hadoop.hive.ql.Driver>
16/04/20 15:23:42 INFO Driver: Starting command(queryId=xubo_20160420152340_15259627-7066-455b-9645-4b8b9835b7f8): CREATE TABLE IF NOT EXISTS src (key INT, value STRING)
16/04/20 15:23:42 INFO PerfLogger: </PERFLOG method=TimeToSubmit start=1461137020590 end=1461137022324 duration=1734 from=org.apache.hadoop.hive.ql.Driver>
16/04/20 15:23:42 INFO PerfLogger: <PERFLOG method=runTasks from=org.apache.hadoop.hive.ql.Driver>
16/04/20 15:23:42 INFO PerfLogger: <PERFLOG method=task.DDL.Stage-0 from=org.apache.hadoop.hive.ql.Driver>
16/04/20 15:23:42 INFO Driver: Starting task [Stage-0:DDL] in serial mode
16/04/20 15:23:42 INFO HiveMetaStore: 0: create_table: Table(tableName:src, dbName:default, owner:xubo, createTime:1461137022, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:key, type:int, comment:null), FieldSchema(name:value, type:string, comment:null)], location:null, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[], parameters:{}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE, privileges:PrincipalPrivilegeSet(userPrivileges:{}, groupPrivileges:null, rolePrivileges:null), temporary:false)
16/04/20 15:23:42 INFO audit: ugi=xubo	ip=unknown-ip-addr	cmd=create_table: Table(tableName:src, dbName:default, owner:xubo, createTime:1461137022, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:key, type:int, comment:null), FieldSchema(name:value, type:string, comment:null)], location:null, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[], parameters:{}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE, privileges:PrincipalPrivilegeSet(userPrivileges:{}, groupPrivileges:null, rolePrivileges:null), temporary:false)
16/04/20 15:23:42 INFO FileUtils: Creating directory if it doesn't exist: file:/user/hive/warehouse/src
-chgrp: 'NT AUTHORITY\SYSTEM' does not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
16/04/20 15:23:43 INFO PerfLogger: </PERFLOG method=runTasks start=1461137022324 end=1461137023508 duration=1184 from=org.apache.hadoop.hive.ql.Driver>
16/04/20 15:23:43 INFO PerfLogger: </PERFLOG method=Driver.execute start=1461137022294 end=1461137023508 duration=1214 from=org.apache.hadoop.hive.ql.Driver>
16/04/20 15:23:43 INFO Driver: OK
16/04/20 15:23:43 INFO PerfLogger: <PERFLOG method=releaseLocks from=org.apache.hadoop.hive.ql.Driver>
16/04/20 15:23:43 INFO PerfLogger: </PERFLOG method=releaseLocks start=1461137023536 end=1461137023536 duration=0 from=org.apache.hadoop.hive.ql.Driver>
16/04/20 15:23:43 INFO PerfLogger: </PERFLOG method=Driver.run start=1461137020590 end=1461137023536 duration=2946 from=org.apache.hadoop.hive.ql.Driver>
DataFrame[result:string]
> sql(hiveContext, "LOAD DATA LOCAL INPATH 'D:/all/R/examples/src/main/resources/kv1.txt' INTO TABLE src")
16/04/20 15:23:56 INFO ParseDriver: Parsing command: LOAD DATA LOCAL INPATH 'D:/all/R/examples/src/main/resources/kv1.txt' INTO TABLE src
16/04/20 15:23:56 INFO ParseDriver: Parse Completed
16/04/20 15:23:56 INFO PerfLogger: <PERFLOG method=Driver.run from=org.apache.hadoop.hive.ql.Driver>
16/04/20 15:23:56 INFO PerfLogger: <PERFLOG method=TimeToSubmit from=org.apache.hadoop.hive.ql.Driver>
16/04/20 15:23:56 INFO PerfLogger: <PERFLOG method=compile from=org.apache.hadoop.hive.ql.Driver>
16/04/20 15:23:56 INFO PerfLogger: <PERFLOG method=parse from=org.apache.hadoop.hive.ql.Driver>
16/04/20 15:23:56 INFO ParseDriver: Parsing command: LOAD DATA LOCAL INPATH 'D:/all/R/examples/src/main/resources/kv1.txt' INTO TABLE src
16/04/20 15:23:56 INFO ParseDriver: Parse Completed
16/04/20 15:23:56 INFO PerfLogger: </PERFLOG method=parse start=1461137036812 end=1461137036814 duration=2 from=org.apache.hadoop.hive.ql.Driver>
16/04/20 15:23:56 INFO PerfLogger: <PERFLOG method=semanticAnalyze from=org.apache.hadoop.hive.ql.Driver>
16/04/20 15:23:56 INFO HiveMetaStore: 0: get_table : db=default tbl=src
16/04/20 15:23:56 INFO audit: ugi=xubo	ip=unknown-ip-addr	cmd=get_table : db=default tbl=src
16/04/20 15:23:57 INFO Driver: Semantic Analysis Completed
16/04/20 15:23:57 INFO PerfLogger: </PERFLOG method=semanticAnalyze start=1461137036814 end=1461137037392 duration=578 from=org.apache.hadoop.hive.ql.Driver>
16/04/20 15:23:57 INFO Driver: Returning Hive schema: Schema(fieldSchemas:null, properties:null)
16/04/20 15:23:57 INFO PerfLogger: </PERFLOG method=compile start=1461137036811 end=1461137037392 duration=581 from=org.apache.hadoop.hive.ql.Driver>
16/04/20 15:23:57 INFO Driver: Concurrency mode is disabled, not creating a lock manager
16/04/20 15:23:57 INFO PerfLogger: <PERFLOG method=Driver.execute from=org.apache.hadoop.hive.ql.Driver>
16/04/20 15:23:57 INFO Driver: Starting command(queryId=xubo_20160420152356_a21d3870-edc2-4753-b114-c612c025c7f0): LOAD DATA LOCAL INPATH 'D:/all/R/examples/src/main/resources/kv1.txt' INTO TABLE src
16/04/20 15:23:57 INFO PerfLogger: </PERFLOG method=TimeToSubmit start=1461137036811 end=1461137037392 duration=581 from=org.apache.hadoop.hive.ql.Driver>
16/04/20 15:23:57 INFO PerfLogger: <PERFLOG method=runTasks from=org.apache.hadoop.hive.ql.Driver>
16/04/20 15:23:57 INFO PerfLogger: <PERFLOG method=task.MOVE.Stage-0 from=org.apache.hadoop.hive.ql.Driver>
16/04/20 15:23:57 INFO Driver: Starting task [Stage-0:MOVE] in serial mode
16/04/20 15:23:57 INFO Task: Loading data to table default.src from file:/D:/all/R/examples/src/main/resources/kv1.txt
16/04/20 15:23:57 INFO HiveMetaStore: 0: get_table : db=default tbl=src
16/04/20 15:23:57 INFO audit: ugi=xubo	ip=unknown-ip-addr	cmd=get_table : db=default tbl=src
16/04/20 15:23:57 INFO HiveMetaStore: 0: get_table : db=default tbl=src
16/04/20 15:23:57 INFO audit: ugi=xubo	ip=unknown-ip-addr	cmd=get_table : db=default tbl=src
16/04/20 15:23:57 INFO SessionState: Could not get hdfsEncryptionShim, it is only applicable to hdfs filesystem.
16/04/20 15:23:57 INFO Hive: Renaming src: file:/D:/all/R/examples/src/main/resources/kv1.txt, dest: file:/user/hive/warehouse/src/kv1.txt, Status:true
-chgrp: 'xubo-PC\None' does not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
16/04/20 15:23:57 INFO HiveMetaStore: 0: alter_table: db=default tbl=src newtbl=src
16/04/20 15:23:57 INFO audit: ugi=xubo	ip=unknown-ip-addr	cmd=alter_table: db=default tbl=src newtbl=src
16/04/20 15:23:57 INFO log: Updating table stats fast for src
16/04/20 15:23:57 INFO log: Updated size of table src to 5812
16/04/20 15:23:58 INFO PerfLogger: <PERFLOG method=task.STATS.Stage-1 from=org.apache.hadoop.hive.ql.Driver>
16/04/20 15:23:58 INFO Driver: Starting task [Stage-1:STATS] in serial mode
16/04/20 15:23:58 INFO StatsTask: Executing stats task
16/04/20 15:23:58 INFO HiveMetaStore: 0: get_table : db=default tbl=src
16/04/20 15:23:58 INFO audit: ugi=xubo	ip=unknown-ip-addr	cmd=get_table : db=default tbl=src
16/04/20 15:23:58 INFO HiveMetaStore: 0: get_table : db=default tbl=src
16/04/20 15:23:58 INFO audit: ugi=xubo	ip=unknown-ip-addr	cmd=get_table : db=default tbl=src
16/04/20 15:23:58 INFO HiveMetaStore: 0: alter_table: db=default tbl=src newtbl=src
16/04/20 15:23:58 INFO audit: ugi=xubo	ip=unknown-ip-addr	cmd=alter_table: db=default tbl=src newtbl=src
16/04/20 15:23:58 INFO log: Updating table stats fast for src
16/04/20 15:23:58 INFO log: Updated size of table src to 5812
16/04/20 15:23:58 INFO Task: Table default.src stats: [numFiles=1, totalSize=5812]
16/04/20 15:23:58 INFO PerfLogger: </PERFLOG method=runTasks start=1461137037402 end=1461137038127 duration=725 from=org.apache.hadoop.hive.ql.Driver>
16/04/20 15:23:58 INFO PerfLogger: </PERFLOG method=Driver.execute start=1461137037392 end=1461137038127 duration=735 from=org.apache.hadoop.hive.ql.Driver>
16/04/20 15:23:58 INFO Driver: OK
16/04/20 15:23:58 INFO PerfLogger: <PERFLOG method=releaseLocks from=org.apache.hadoop.hive.ql.Driver>
16/04/20 15:23:58 INFO PerfLogger: </PERFLOG method=releaseLocks start=1461137038127 end=1461137038127 duration=0 from=org.apache.hadoop.hive.ql.Driver>
16/04/20 15:23:58 INFO PerfLogger: </PERFLOG method=Driver.run start=1461137036811 end=1461137038127 duration=1316 from=org.apache.hadoop.hive.ql.Driver>
DataFrame[result:string]
> results <- sql(hiveContext, "FROM src SELECT key, value")
16/04/20 15:24:04 INFO ParseDriver: Parsing command: FROM src SELECT key, value
16/04/20 15:24:04 INFO ParseDriver: Parse Completed
16/04/20 15:24:04 INFO HiveMetaStore: 0: get_table : db=default tbl=src
16/04/20 15:24:04 INFO audit: ugi=xubo	ip=unknown-ip-addr	cmd=get_table : db=default tbl=src
> head(results)
16/04/20 15:24:21 INFO MemoryStore: ensureFreeSpace(448800) called with curMem=0, maxMem=555684986
16/04/20 15:24:21 INFO MemoryStore: Block broadcast_7 stored as values in memory (estimated size 438.3 KB, free 529.5 MB)
16/04/20 15:24:21 INFO MemoryStore: ensureFreeSpace(42638) called with curMem=448800, maxMem=555684986
16/04/20 15:24:21 INFO MemoryStore: Block broadcast_7_piece0 stored as bytes in memory (estimated size 41.6 KB, free 529.5 MB)
16/04/20 15:24:21 INFO BlockManagerInfo: Added broadcast_7_piece0 in memory on localhost:57845 (size: 41.6 KB, free: 529.9 MB)
16/04/20 15:24:21 INFO SparkContext: Created broadcast 7 from dfToCols at NativeMethodAccessorImpl.java:-2
16/04/20 15:24:22 INFO FileInputFormat: Total input paths to process : 1
16/04/20 15:24:22 INFO SparkContext: Starting job: dfToCols at NativeMethodAccessorImpl.java:-2
16/04/20 15:24:22 INFO DAGScheduler: Got job 4 (dfToCols at NativeMethodAccessorImpl.java:-2) with 1 output partitions
16/04/20 15:24:22 INFO DAGScheduler: Final stage: ResultStage 4(dfToCols at NativeMethodAccessorImpl.java:-2)
16/04/20 15:24:22 INFO DAGScheduler: Parents of final stage: List()
16/04/20 15:24:22 INFO DAGScheduler: Missing parents: List()
16/04/20 15:24:22 INFO DAGScheduler: Submitting ResultStage 4 (MapPartitionsRDD[18] at dfToCols at NativeMethodAccessorImpl.java:-2), which has no missing parents
16/04/20 15:24:22 INFO MemoryStore: ensureFreeSpace(5992) called with curMem=491438, maxMem=555684986
16/04/20 15:24:22 INFO MemoryStore: Block broadcast_8 stored as values in memory (estimated size 5.9 KB, free 529.5 MB)
16/04/20 15:24:22 INFO MemoryStore: ensureFreeSpace(3373) called with curMem=497430, maxMem=555684986
16/04/20 15:24:22 INFO MemoryStore: Block broadcast_8_piece0 stored as bytes in memory (estimated size 3.3 KB, free 529.5 MB)
16/04/20 15:24:22 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on localhost:57845 (size: 3.3 KB, free: 529.9 MB)
16/04/20 15:24:22 INFO SparkContext: Created broadcast 8 from broadcast at DAGScheduler.scala:861
16/04/20 15:24:22 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 4 (MapPartitionsRDD[18] at dfToCols at NativeMethodAccessorImpl.java:-2)
16/04/20 15:24:22 INFO TaskSchedulerImpl: Adding task set 4.0 with 1 tasks
16/04/20 15:24:22 INFO TaskSetManager: Starting task 0.0 in stage 4.0 (TID 4, localhost, PROCESS_LOCAL, 2146 bytes)
16/04/20 15:24:22 INFO Executor: Running task 0.0 in stage 4.0 (TID 4)
16/04/20 15:24:22 INFO HadoopRDD: Input split: file:/user/hive/warehouse/src/kv1.txt:0+5812
16/04/20 15:24:22 INFO Executor: Finished task 0.0 in stage 4.0 (TID 4). 2652 bytes result sent to driver
16/04/20 15:24:22 INFO TaskSetManager: Finished task 0.0 in stage 4.0 (TID 4) in 66 ms on localhost (1/1)
16/04/20 15:24:22 INFO DAGScheduler: ResultStage 4 (dfToCols at NativeMethodAccessorImpl.java:-2) finished in 0.067 s
16/04/20 15:24:22 INFO TaskSchedulerImpl: Removed TaskSet 4.0, whose tasks have all completed, from pool
16/04/20 15:24:22 INFO DAGScheduler: Job 4 finished: dfToCols at NativeMethodAccessorImpl.java:-2, took 0.080229 s
key   value
1 238 val_238
2  86  val_86
3 311 val_311
4  27  val_27
5 165 val_165
6 409 val_409
>


3.2.3 DataFrame

df <- createDataFrame(sqlContext, faithful)
16/04/20 15:25:51 INFO SparkContext: Starting job: collectPartitions at NativeMethodAccessorImpl.java:-2
16/04/20 15:25:51 INFO DAGScheduler: Got job 5 (collectPartitions at NativeMethodAccessorImpl.java:-2) with 1 output partitions
16/04/20 15:25:51 INFO DAGScheduler: Final stage: ResultStage 5(collectPartitions at NativeMethodAccessorImpl.java:-2)
16/04/20 15:25:51 INFO DAGScheduler: Parents of final stage: List()
16/04/20 15:25:51 INFO DAGScheduler: Missing parents: List()
16/04/20 15:25:51 INFO DAGScheduler: Submitting ResultStage 5 (ParallelCollectionRDD[19] at parallelize at RRDD.scala:454), which has no missing parents
16/04/20 15:25:51 INFO MemoryStore: ensureFreeSpace(1280) called with curMem=500803, maxMem=555684986
16/04/20 15:25:51 INFO MemoryStore: Block broadcast_9 stored as values in memory (estimated size 1280.0 B, free 529.5 MB)
16/04/20 15:25:51 INFO MemoryStore: ensureFreeSpace(854) called with curMem=502083, maxMem=555684986
16/04/20 15:25:51 INFO MemoryStore: Block broadcast_9_piece0 stored as bytes in memory (estimated size 854.0 B, free 529.5 MB)
16/04/20 15:25:51 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on localhost:57845 (size: 854.0 B, free: 529.9 MB)
16/04/20 15:25:51 INFO SparkContext: Created broadcast 9 from broadcast at DAGScheduler.scala:861
16/04/20 15:25:51 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 5 (ParallelCollectionRDD[19] at parallelize at RRDD.scala:454)
16/04/20 15:25:51 INFO TaskSchedulerImpl: Adding task set 5.0 with 1 tasks
16/04/20 15:25:51 INFO TaskSetManager: Starting task 0.0 in stage 5.0 (TID 5, localhost, PROCESS_LOCAL, 12983 bytes)
16/04/20 15:25:51 INFO Executor: Running task 0.0 in stage 5.0 (TID 5)
16/04/20 15:25:51 INFO Executor: Finished task 0.0 in stage 5.0 (TID 5). 11877 bytes result sent to driver
16/04/20 15:25:51 INFO TaskSetManager: Finished task 0.0 in stage 5.0 (TID 5) in 7 ms on localhost (1/1)
16/04/20 15:25:51 INFO TaskSchedulerImpl: Removed TaskSet 5.0, whose tasks have all completed, from pool
16/04/20 15:25:51 INFO DAGScheduler: ResultStage 5 (collectPartitions at NativeMethodAccessorImpl.java:-2) finished in 0.008 s
16/04/20 15:25:51 INFO DAGScheduler: Job 5 finished: collectPartitions at NativeMethodAccessorImpl.java:-2, took 0.016186 s
> df
DataFrame[eruptions:double, waiting:double]
> head(select(df, df$eruptions))
16/04/20 15:26:07 INFO SparkContext: Starting job: dfToCols at NativeMethodAccessorImpl.java:-2
16/04/20 15:26:07 INFO DAGScheduler: Got job 6 (dfToCols at NativeMethodAccessorImpl.java:-2) with 1 output partitions
16/04/20 15:26:07 INFO DAGScheduler: Final stage: ResultStage 6(dfToCols at NativeMethodAccessorImpl.java:-2)
16/04/20 15:26:07 INFO DAGScheduler: Parents of final stage: List()
16/04/20 15:26:07 INFO DAGScheduler: Missing parents: List()
16/04/20 15:26:07 INFO DAGScheduler: Submitting ResultStage 6 (MapPartitionsRDD[25] at dfToCols at NativeMethodAccessorImpl.java:-2), which has no missing parents
16/04/20 15:26:07 INFO MemoryStore: ensureFreeSpace(11368) called with curMem=502937, maxMem=555684986
16/04/20 15:26:07 INFO MemoryStore: Block broadcast_10 stored as values in memory (estimated size 11.1 KB, free 529.5 MB)
16/04/20 15:26:07 INFO MemoryStore: ensureFreeSpace(4805) called with curMem=514305, maxMem=555684986
16/04/20 15:26:07 INFO MemoryStore: Block broadcast_10_piece0 stored as bytes in memory (estimated size 4.7 KB, free 529.4 MB)
16/04/20 15:26:07 INFO BlockManagerInfo: Added broadcast_10_piece0 in memory on localhost:57845 (size: 4.7 KB, free: 529.9 MB)
16/04/20 15:26:07 INFO SparkContext: Created broadcast 10 from broadcast at DAGScheduler.scala:861
16/04/20 15:26:07 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 6 (MapPartitionsRDD[25] at dfToCols at NativeMethodAccessorImpl.java:-2)
16/04/20 15:26:07 INFO TaskSchedulerImpl: Adding task set 6.0 with 1 tasks
16/04/20 15:26:07 INFO TaskSetManager: Starting task 0.0 in stage 6.0 (TID 6, localhost, PROCESS_LOCAL, 12983 bytes)
16/04/20 15:26:07 INFO Executor: Running task 0.0 in stage 6.0 (TID 6)
16/04/20 15:26:08 INFO GenerateUnsafeProjection: Code generated in 297.30299 ms
16/04/20 15:26:08 INFO GenerateSafeProjection: Code generated in 10.540036 ms
16/04/20 15:26:08 INFO Executor: Finished task 0.0 in stage 6.0 (TID 6). 1522 bytes result sent to driver
16/04/20 15:26:08 INFO TaskSetManager: Finished task 0.0 in stage 6.0 (TID 6) in 1403 ms on localhost (1/1)
16/04/20 15:26:08 INFO TaskSchedulerImpl: Removed TaskSet 6.0, whose tasks have all completed, from pool
16/04/20 15:26:08 INFO DAGScheduler: ResultStage 6 (dfToCols at NativeMethodAccessorImpl.java:-2) finished in 1.404 s
16/04/20 15:26:08 INFO DAGScheduler: Job 6 finished: dfToCols at NativeMethodAccessorImpl.java:-2, took 1.407687 s
eruptions
1     3.600
2     1.800
3     3.333
4     2.283
5     4.533
6     2.883
>
> # You can also pass in column name as strings
> head(select(df, "eruptions"))
16/04/20 15:26:21 INFO SparkContext: Starting job: dfToCols at NativeMethodAccessorImpl.java:-2
16/04/20 15:26:21 INFO DAGScheduler: Got job 7 (dfToCols at NativeMethodAccessorImpl.java:-2) with 1 output partitions
16/04/20 15:26:21 INFO DAGScheduler: Final stage: ResultStage 7(dfToCols at NativeMethodAccessorImpl.java:-2)
16/04/20 15:26:21 INFO DAGScheduler: Parents of final stage: List()
16/04/20 15:26:21 INFO DAGScheduler: Missing parents: List()
16/04/20 15:26:21 INFO DAGScheduler: Submitting ResultStage 7 (MapPartitionsRDD[28] at dfToCols at NativeMethodAccessorImpl.java:-2), which has no missing parents
16/04/20 15:26:21 INFO MemoryStore: ensureFreeSpace(11368) called with curMem=519110, maxMem=555684986
16/04/20 15:26:21 INFO MemoryStore: Block broadcast_11 stored as values in memory (estimated size 11.1 KB, free 529.4 MB)
16/04/20 15:26:21 INFO MemoryStore: ensureFreeSpace(4801) called with curMem=530478, maxMem=555684986
16/04/20 15:26:21 INFO MemoryStore: Block broadcast_11_piece0 stored as bytes in memory (estimated size 4.7 KB, free 529.4 MB)
16/04/20 15:26:21 INFO BlockManagerInfo: Added broadcast_11_piece0 in memory on localhost:57845 (size: 4.7 KB, free: 529.9 MB)
16/04/20 15:26:21 INFO SparkContext: Created broadcast 11 from broadcast at DAGScheduler.scala:861
16/04/20 15:26:21 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 7 (MapPartitionsRDD[28] at dfToCols at NativeMethodAccessorImpl.java:-2)
16/04/20 15:26:21 INFO TaskSchedulerImpl: Adding task set 7.0 with 1 tasks
16/04/20 15:26:21 INFO TaskSetManager: Starting task 0.0 in stage 7.0 (TID 7, localhost, PROCESS_LOCAL, 12983 bytes)
16/04/20 15:26:21 INFO Executor: Running task 0.0 in stage 7.0 (TID 7)
16/04/20 15:26:22 INFO Executor: Finished task 0.0 in stage 7.0 (TID 7). 1522 bytes result sent to driver
16/04/20 15:26:22 INFO DAGScheduler: ResultStage 7 (dfToCols at NativeMethodAccessorImpl.java:-2) finished in 0.874 s
16/04/20 15:26:22 INFO DAGScheduler: Job 7 finished: dfToCols at NativeMethodAccessorImpl.java:-2, took 0.895538 s
16/04/20 15:26:22 INFO TaskSetManager: Finished task 0.0 in stage 7.0 (TID 7) in 872 ms on localhost (1/1)
16/04/20 15:26:22 INFO TaskSchedulerImpl: Removed TaskSet 7.0, whose tasks have all completed, from pool
eruptions
1     3.600
2     1.800
3     3.333
4     2.283
5     4.533
6     2.883
> # Filter the DataFrame to only retain rows with wait times shorter than 50 minshead(filter(df, df$waiting < 50))##  eruptions waiting##1     1.750      47##2     1.750      47##3     1.867      48
> head(filter(df, df$waiting < 50))
16/04/20 15:26:41 INFO SparkContext: Starting job: dfToCols at NativeMethodAccessorImpl.java:-2
16/04/20 15:26:41 INFO DAGScheduler: Got job 8 (dfToCols at NativeMethodAccessorImpl.java:-2) with 1 output partitions
16/04/20 15:26:41 INFO DAGScheduler: Final stage: ResultStage 8(dfToCols at NativeMethodAccessorImpl.java:-2)
16/04/20 15:26:41 INFO DAGScheduler: Parents of final stage: List()
16/04/20 15:26:41 INFO DAGScheduler: Missing parents: List()
16/04/20 15:26:41 INFO DAGScheduler: Submitting ResultStage 8 (MapPartitionsRDD[30] at dfToCols at NativeMethodAccessorImpl.java:-2), which has no missing parents
16/04/20 15:26:41 INFO MemoryStore: ensureFreeSpace(11368) called with curMem=535279, maxMem=555684986
16/04/20 15:26:41 INFO MemoryStore: Block broadcast_12 stored as values in memory (estimated size 11.1 KB, free 529.4 MB)
16/04/20 15:26:41 INFO MemoryStore: ensureFreeSpace(4835) called with curMem=546647, maxMem=555684986
16/04/20 15:26:41 INFO MemoryStore: Block broadcast_12_piece0 stored as bytes in memory (estimated size 4.7 KB, free 529.4 MB)
16/04/20 15:26:41 INFO BlockManagerInfo: Added broadcast_12_piece0 in memory on localhost:57845 (size: 4.7 KB, free: 529.9 MB)
16/04/20 15:26:41 INFO SparkContext: Created broadcast 12 from broadcast at DAGScheduler.scala:861
16/04/20 15:26:41 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 8 (MapPartitionsRDD[30] at dfToCols at NativeMethodAccessorImpl.java:-2)
16/04/20 15:26:41 INFO TaskSchedulerImpl: Adding task set 8.0 with 1 tasks
16/04/20 15:26:41 INFO BlockManagerInfo: Removed broadcast_11_piece0 on localhost:57845 in memory (size: 4.7 KB, free: 529.9 MB)
16/04/20 15:26:41 INFO TaskSetManager: Starting task 0.0 in stage 8.0 (TID 8, localhost, PROCESS_LOCAL, 12983 bytes)
16/04/20 15:26:41 INFO Executor: Running task 0.0 in stage 8.0 (TID 8)
16/04/20 15:26:41 INFO ContextCleaner: Cleaned accumulator 11
16/04/20 15:26:41 INFO BlockManagerInfo: Removed broadcast_10_piece0 on localhost:57845 in memory (size: 4.7 KB, free: 529.9 MB)
16/04/20 15:26:41 INFO ContextCleaner: Cleaned accumulator 9
16/04/20 15:26:41 INFO BlockManagerInfo: Removed broadcast_9_piece0 on localhost:57845 in memory (size: 854.0 B, free: 529.9 MB)
16/04/20 15:26:41 INFO ContextCleaner: Cleaned accumulator 6
16/04/20 15:26:41 INFO BlockManagerInfo: Removed broadcast_8_piece0 on localhost:57845 in memory (size: 3.3 KB, free: 529.9 MB)
16/04/20 15:26:41 INFO ContextCleaner: Cleaned accumulator 5
16/04/20 15:26:41 INFO BlockManagerInfo: Removed broadcast_7_piece0 on localhost:57845 in memory (size: 41.6 KB, free: 529.9 MB)
16/04/20 15:26:42 INFO GeneratePredicate: Code generated in 48.660598 ms
16/04/20 15:26:42 INFO Executor: Finished task 0.0 in stage 8.0 (TID 8). 1675 bytes result sent to driver
16/04/20 15:26:42 INFO TaskSetManager: Finished task 0.0 in stage 8.0 (TID 8) in 1000 ms on localhost (1/1)
16/04/20 15:26:42 INFO DAGScheduler: ResultStage 8 (dfToCols at NativeMethodAccessorImpl.java:-2) finished in 1.000 s
16/04/20 15:26:42 INFO TaskSchedulerImpl: Removed TaskSet 8.0, whose tasks have all completed, from pool
16/04/20 15:26:42 INFO DAGScheduler: Job 8 finished: dfToCols at NativeMethodAccessorImpl.java:-2, took 1.108020 s
eruptions waiting
1     1.750      47
2     1.750      47
3     1.867      48
4     1.750      48
5     2.167      48
6     2.100      49
>


参考:

【1】 learning R

【2】 http://f.dataguru.cn/forum.php?mod=viewthread&tid=103480
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息