您的位置:首页 > 其它

spark-on-yarn作业提交缓慢优化

2017-09-30 14:11 323 查看

spark on yanr方式运行计算作业,发现作业提交缓慢

根据日志,提交缓慢主要在两个过程:

一、uploading file太慢

17/05/09 10:13:28 INFO yarn.Client: Uploading resource file:/opt/cloudera/parcels/spark-1.6.3-bin-hadoop2.6/lib/spark-assembly-1.6.3-hadoop2.6.0.jar -> hdfs://nameservice1/user/root/.sparkStaging/application_1493349445616_12544/spark-assembly-1.6.3-hadoop2.6.0.jar

17/05/09 10:13:36 INFO yarn.Client: Uploading resource file:/home/wis2_work/wis-spark-stream-1.0.0-all.jar -> hdfs://nameservice1/user/root/.sparkStaging/application_1493349445616_12544/wis-spark-stream-1.0.0-all.jar

这个日志输出后再上传程序依赖的jar包,大约耗时30s左右,造成提交缓慢,官网解决办法:如果想要在yarn端(yarn的节点)访问spark的runtime jars,需要指定spark.yarn.archive 或者 spark.yarn.jars。如果都这两个参数都没有指定,spark就会把$SPARK_HOME/jars/所有的jar上传到分布式缓存中。这也是之前任务提交特别慢的原因。

下面是解决办法

1、将$SPARK_HOME/相关依赖jar包上传到hdfs上

hadoop fs -mkdir /wis/tmp

hadoop fs -put /opt/cloudera/parcels/spark-1.6.3-bin-hadoop2.6/lib/spark-*.jar /wis/tmp/

2、修改spark-default.conf参数,添加:

spark.yarn.jar                 hdfs://nameservice1/wis/tmp/*.jar

以下几种方式也可生效

#spark.yarn.jar                  hdfs://nameservice1/wis/tmp/*

##直接配置多个以逗号分隔的jar,也可以生效。

注:1.6.3版本为spark.yarn.jar,详:http://spark.apache.org/docs/1.6.3/running-on-yarn.html#configuration

    2.1.1版本为spark.yarn.jars,详:http://spark.apache.org/docs/latest/running-on-yarn.html#configuration

3、修改作业提交脚本中jar路径为hdfs路径,不然还是会以本地提交jar到hdfs,影响效率

二、spark认不到HADOOP_HOME

17/05/16 17:10:49 DEBUG Shell: Failed to detect a valid hadoop home directory

java.io.IOException: HADOOP_HOME or hadoop.home.dir are not set.

 at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:302)

 at org.apache.hadoop.util.Shell.<clinit>(Shell.java:327)

 at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79)

 at org.apache.hadoop.yarn.conf.YarnConfiguration.<clinit>(YarnConfiguration.java:590)

 at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil.newConfiguration(YarnSparkHadoopUtil.scala:66)

 at org.apache.spark.deploy.SparkHadoopUtil.<init>(SparkHadoopUtil.scala:52)

 at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil.<init>(YarnSparkHadoopUtil.scala:51)

 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

解决方式:在spark-env.sh中添加如下配置

export HADOOP_HOME=/opt/cloudera/parcels/CDH/lib/hadoop

export SPARK_HOME=/opt/cloudera/parcels/spark-1.6.3-bin-hadoop2.6

export HADOOP_CONF_DIR=/etc/hadoop/conf

if [ -n "$HADOOP_HOME" ]; then

  export LD_LIBRARY_PATH=:/lib/native

fi

三、spark加载hadoop库异常

17/05/16 17:11:18 DEBUG NativeCodeLoader: Trying to load the custom-built native-hadoop library...

17/05/16 17:11:18 DEBUG NativeCodeLoader: Failed to load native-hadoop with error: java.lang.UnsatisfiedLinkError: no hadoop in java.library.path

17/05/16 17:11:18 DEBUG NativeCodeLoader: java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib

17/05/16 17:11:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

原因是:jre目录下缺少了libhadoop.so和libsnappy.so两个文件。具体是,spark-shell依赖的是scala,scala依赖的是JAVA_HOME下的jdk,libhadoop.so和libsnappy.so两个文件应该放到$JAVA_HOME/jre/lib/amd64下面。这两个so:libhadoop.so和libsnappy.so 一般在hadoop native lib下面,通过配置spark-default.conf添加:

spark.executor.extraJavaOptions -XX:MetaspaceSize=300M -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Dlog4j.rootCategory=INFO -Djava.library.path=/opt/cloudera/parcels/CDH/lib/hadoop/lib/native

再cp 这两个文件到jre目录下,得以解决

GC策略,日志  暂时不予配置,可能导致container初始失败问题

四、YARN  ACCEPTED缓慢

调整hadoop角色,优化RM角色所在服务器资源分配
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: