您的位置:首页 > 其它

spark开发问题记录

2017-05-21 17:21 344 查看
环境:spark-2.1.0-bin-hadoop2.7

一、spark on yarn client模式

1.JavaSparkContext not serializable

解决:

JavaSparkContext不是可序列化的,是不应该。它不能用于函数发送到远程工作者。使用static修饰JavaSparkContext,序列化会忽略静态变量,即序列化不保存静态变量的状态。transient后的变量也不能序列化

2.使用ES API报错:org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: Found unrecoverable error [127.0.0.1:9200] returned Bad Request(400) - failed to parse; Bailing out..

这种错误一般是ES API和DStream类型没有匹配,导致转换失败,当然也可能是数据类型本身转换问题

saveToEs对应DStream type needs to be a Map (either a Scala or a Java

one), a JavaBean or a Scala case class.

saveJsonToEs 对应JSON字符串

saveToEsWithMeta 对应pair DStream

3.java.lang.NoClassDefFoundError: org/apache/spark/SparkConf

将依赖scope “provided” 改为 “compile”,

4.引入了elasticsearch-hadoop-5.2.2后log4j jar包与spark log4j jar包冲突。报错如下

Caused by: java.lang.IllegalStateException: Detected both log4j-over-slf4j.jar AND slf4j-log4j12.jar on the class path, preempting StackOverflowError


解决:加入依赖排除

<dependency>
<groupId>org.elasticsearch</groupId>
<artifactId>elasticsearch-hadoop</artifactId>
<version>5.2.2</version>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>log4j-over-slf4j</artifactId>
</exclusion>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
</exclusion>
</exclusions>
</dependency>


二、standalone模式

1.错误

Exception while invoking mkdirs of class ClientNamenodeProtocolTranslatorPB over 10.107.99.217/10.107.99.217:9000 after 10 fail over attempts. Trying to fail over after sleeping for 13156ms.
java.net.ConnectException: Call From VM-100-181-centos/127.0.0.1 to 10.107.99.217:9000 failed on


原因:代码中设置了检查点,但集群模式检查点目录默认存在HDFS上,由于HDFS服务不可用,创建目录失败

2.在使用spark standalone模式过程中,有时会因为数据增大,而出现下面两种错误:

java.lang.OutOfMemoryError: Java heap space

java.lang.OutOfMemoryError:GC overhead limit exceeded


原因是driver内存不够,在不指定给driver分配内存时,默认分配的是512M。需要在spark-submit提交时指定 -driver-memory 2g

3.Spark to ES由于只配置一个ES节点,通信失败导致错误

Caused by: org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'
at org.elasticsearch.hadoop.rest.InitializationUtils.discoverEsVersion(InitializationUtils.java:250)
at org.elasticsearch.hadoop.rest.RestService.createWriter(RestService.java:546)
at org.elasticsearch.spark.rdd.EsRDDWriter.write(EsRDDWriter.scala:58)
at org.elasticsearch.spark.rdd.EsSpark$$anonfun$doSaveToEs$1.apply(EsSpark.scala:102)
at org.elasticsearch.spark.rdd.EsSpark$$anonfun$doSaveToEs$1.apply(EsSpark.scala:102)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
... 3 more
Caused by: org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: [GET] on [] failed; server[10.255.10.xxx:9200] returned [503|Service Unavailable:]


解决:配置多个ES节点,es.nodes = 10.255.10.xxx,10.255.10.xxx,10.107.99.xxx,10.107.103.xxx
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  spark