spark 体验点滴-client 与 cluster 部署
2017-12-12 16:03
381 查看
Spark运行模式:cluster与client
一. 部署模式原理When run SparkSubmit --class [mainClass], SparkSubmit will call a childMainClass which is
1. client mode, childMainClass = mainClass
2. standalone cluster mde, childMainClass = org.apache.spark.deploy.Client
3. yarn cluster mode, childMainClass = org.apache.spark.deploy.yarn.Client
The childMainClass is a wrapper of mainClass. The childMainClass will be called in SparkSubmit, and if cluster mode, the childMainClass will talk to the the cluster and launch a process on one woker to run the mainClass.
ps. use "spark-submit -v" to print debug infos.
Yarn client: spark-submit -v --class "org.apache.spark.examples.JavaWordCount" --master yarn JavaWordCount.jar
childMainclass: org.apache.spark.examples.JavaWordCount
Yarn cluster: spark-submit -v --class "org.apache.spark.examples.JavaWordCount" --master yarn-cluster JavaWordCount.jar
childMainclass: org.apache.spark.deploy.yarn.Client
Standalone client: spark-submit -v --class "org.apache.spark.examples.JavaWordCount" --master spark://aa01:7077 JavaWordCount.jar
childMainclass: org.apache.spark.examples.JavaWordCount
Stanalone cluster: spark-submit -v --class "org.apache.spark.examples.JavaWordCount" --master spark://aa01:7077 --deploy-mode cluster JavaWordCount.jar
childMainclass: org.apache.spark.deploy.rest.RestSubmissionClient (if rest, else org.apache.spark.deploy.Client)
Taking standalone spark as example, here is the client mode workflow. The mainclass run in the driver application which could be reside out of the cluster.
On cluster mode showed as below, SparkSubmit will register driver in the cluster, and a driver process launched in one work running the main class.
There are also two deploy modes that can be used to launch Spark applications on YARN. In yarn-cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In yarn-client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.
Cluster deploy mode is not applicable to Spark shells.
二. 部署注意事项
Bundling Your Application’s Dependencies
If your code depends on other projects, you will need to package them alongside your application in order to distribute the code to a Spark cluster. To do this, to create an assembly jar (or “uber” jar) containing your code and its dependencies. Both sbt and Maven have assembly plugins. When creating assembly jars, list Spark and Hadoop asprovideddependencies; these need not be bundled since they are provided by the cluster manager at runtime. Once you have an assembled jar you can call the
bin/spark-submitscript as shown here while passing your jar.
For Python, you can use the
--py-filesargument of
spark-submitto add
.py,
.zipor
.eggfiles to be distributed with your application. If you depend on multiple Python files we recommend packaging them into a
.zipor
.egg.
备注:
1.必须将项目打包成assembly jars 的形式
2.可以使用maven assembly插件进行打包
3.如果是使用idea开发,可以使用idea的buid方式进行打包
Launching Applications with spark-submit
Once a user application is bundled, it can be launched using thebin/spark-submitscript. This script takes care of setting up the classpath with Spark and its dependencies, and can support different cluster managers and deploy modes that Spark supports:
./bin/spark-submit \ --class <main-class> --master <master-url> \ --deploy-mode <deploy-mode> \ --conf <key>=<value> \ ... # other options <application-jar> \ [application-arguments]
Some of the commonly used options are:
--class: The entry point for your application (e.g.
org.apache.spark.examples.SparkPi)
--master: The master URL for the cluster (e.g.
spark://23.195.26.187:7077)
--deploy-mode: Whether to deploy your driver on the worker nodes (
cluster) or locally as an external client (
client) (default:
client)*
--conf: Arbitrary Spark configuration property in key=value format. For values that contain spaces wrap “key=value” in quotes (as shown).
application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an
hdfs://path or a
file://path that is present on all nodes.
application-arguments: Arguments passed to the main method of your main class, if any
*A common deployment strategy is to submit your application from a gateway machine that is physically co-located with your worker machines (e.g. Master node in a standalone EC2 cluster). In this setup,
clientmode is appropriate. In
clientmode, the driver is launched directly within the client
spark-submitprocess, with the input and output of the application attached to the console. Thus, this mode is especially suitable for applications that involve the REPL (e.g. Spark shell).
Alternatively, if your application is submitted from a machine far from the worker machines (e.g. locally on your laptop), it is common to use
clustermode to minimize network latency between the drivers and the executors. Note that
clustermode is currently not supported for standalone clusters, Mesos clusters, or python applications.
备注:
1.cluster 模式是支持程序自动重启的.
2.重要的事说三遍,重要的事说三遍,重要的事说三遍: cluster 模式不支持standalone clusters, Mesos clusters, or python applications模式.
三.关于checkpoint
1. spark streaming 的checkpoint 有点坑,如果程序有升级,代码结构有变化,重新部署的时候需要删除checkpoint文件夹,不然会报错。但是删除了checkpoint 文件夹,程序里的rdd状态会丢失。
2.spark streaming checkpoint的升级方案是使用structed streaming 的checkpoint ,structed streaming 的checkpoint 支持程序升级。
相关文章推荐
- Spark On Yarn的两种模式yarn-cluster和yarn-client深度剖析
- Spark on Yarn Client和Cluster模式详解
- Spark Client和Cluster两种运行模式的工作流程
- spark on yarn中yarn-cluster与yarn-client区别
- Spark on yarn client 和cluster模式运行序列图
- spark-client和spark-cluster详解
- Spark:Yarn-cluster和Yarn-client区别与联系
- Spark on yarn有分为两种模式yarn-cluster和yarn-client
- Spark on YARN cluster & client 模式作业运行全过程分析
- spark yarn-client和yarn-cluster
- Spark Yarn-cluster与Yarn-client
- Spark:Yarn-cluster和Yarn-client区别与联系
- Spark Client和Cluster两种运行模式的工作流程
- Spark Yarn-cluster与Yarn-client
- spark-06-spark:cluster与client的区别和联系
- Spark:Yarn-cluster和Yarn-client区别与联系
- Spark运行模式:cluster与client
- 一 spark on yarn cluster模式提交作业,一直处于ACCEPTED状态,改了Client模式后就正常了
- Spark的运行模式(2)--Yarn-Cluster和Yarn-Client
- Spark运行模式(local standalond,yarn-client,yarn-cluster,mesos-client,mesos-cluster)