您的位置：首页 > 其它

Spark Streaming On Yarn/ On StandAlone模式下的checkpointing容错

2017-09-08 15:04 567 查看

Spark On Yarn：

在Spark On Yarn模式下部署Spark Streaming 时候，我们需要使用StreamingContext.getOrCreate方法创建StreamingContext实例，指定我们自己的checkpoint目录，用作存储checkpoint数据。

容错1：

当我们使用spark-submit成功提交一个程序之后，我们可以使用jps能够查看到CoarseGrainedExecutorBackend进程，当我们直接kill该线程时候，你会发现立马会再次启动一个全新的CoarseGrainedExecutorBackend进行，这个就是Yarn提供的特性支持自动重启容错，重启的CoarseGrainedExecutorBackend会去读取checkpoint中的数据然后继续计算。

容错2：

如果我们使用yarn application -kill jobid 直接杀死作业的话，再次启动你会发现程序启动不起来了。直接抛异常：

Exception in thread "main" org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:85)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:62)

异常信息也很清晰：Yarn application has already ended! It might have been killed or unable to launch application master. 为了确定该问题只好翻阅源码看问题具体原因，查看YarnClientSchedulerBackend的waitForApplication方法：

/**
* Report the state of the application until it is running.
* If the application has finished, failed or been killed in the process, throw an exception.
* This assumes both `client` and `appId` have already been set.
*/
private def waitForApplication(): Unit = {
assert(client != null && appId.isDefined, "Application has not been submitted yet!")
val (state, _) = client.monitorApplication(appId.get, returnOnRunning = true) // blocking
if (state == YarnApplicationState.FINISHED ||  //判断如果是完成或者失败或者杀死的话 直接抛异常
state == YarnApplicationState.FAILED ||
state == YarnApplicationState.KILLED) {
throw new SparkException("Yarn application has already ended! " +
"It might have been killed or unable to launch application master.")
}
if (state == YarnApplicationState.RUNNING) {
logInfo(s"Application ${appId.get} has started running.")
}
}

通过代码猜测，Spark 设计者的想法是：如果程序异常被杀死某一个进行的话，则又Yarn负责自动重启容错，就如上面容错问题1描述。如果是人为kill，或者执行完毕，或者失败的话，则认为程序本身要么是执行完毕不需要再次运行，要么认为程序有错误不需要再次执行。

所以on yarn的容错是由yarn负责的。

拓展：由于yarn不允许我们重启提交作业从checkpoint中恢复的话，那么如果我们stareaming消费kafka的话，就需要手动完成kafka的消费offset的自行保存，一遍加载启动时候能够继续消费。

Spark On StandAlone：

如果是spark standlone模式提交的话，如果直接在webui上面kill该job的话，会从新加载上次的checkpoint目录完成容错。

spark standlone模式下，也支持kill CoarseGrainedExecutorBackend之后自动重启CoarseGrainedExecutorBackend的。

结论：不同的部署模式容错有差异，还需要视具体情况而定。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航