[Storm] java.io.FileNotFoundException: File '../stormconf.ser' does not exist
2015-11-26 12:05
351 查看
This bug will kill supervisors
Affects Version/s: 0.9.2-incubating, 0.9.3, 0.9.4
Fix Version/s: 0.10.0, 0.9.5
问题背景
最近发现刚搭起的Storm集群,没过多久,Supervisor 便悄然死去了一大半。查看死去Supervisor的log,发现java.io.FileNotFoundException: File '../stormconf.ser' does not exist异常。网上给出的答案大多是将 { storm.local.dir } 目录下的文件清空,重启就好了。
但这是指标不治本,即时重启可以跑起来,可是为什么会出现这个问题,依然不知道。
然后才发现线STORM-130解决了这个问题。该问题的重现场景:
1) Run a storm cluster with atleast 2 supervisors with 4 slots each
2) Deploy a topology that uses 4 workers, topology will be distributed with each supervisor having two workers each
3) kill one of the supervisor lets say supervisor1
4) wait till topology re-balances to occupy 4 workers on supervisor2
5) now bring up supervisor1, It goes through the cycle of cleaning up old topology code
6) nimbus re-balances topology which triggers supervisor.sync-process method
7) sync-process tries to launch a worker for the topology whose code data is delete when the supervisor started causing it throw up following exception
问题原因
上面场景分析提到的 sync-process是supervisor运行的一个函数。Supervisor会在后台运行这两个函数:synchronize-supervisor: This is called whenever assignments in Zookeeper change and also every 10 seconds.
Downloads code from Nimbus for topologies assigned to this machine for which it doesn't have the code yet.
Writes into local filesystem what this node is supposed to be running. It writes a map from port -> LocalAssignment. LocalAssignment contains a topology id as well as the list of task ids for that worker.
sync-processes: Reads from the LFS what
synchronize-supervisorwrote and compares that to what's actually running on the machine. It then starts/stops worker processes as necessary to synchronize.
从描述中可以看出,synchronized-supervisor 和 sync-process 两个函数是通过 LFS 进行同步。The key reason is "synchronize-supervisor" which responsible for download file and remove file thread and "sync-processes" which responsible for start worker process thread is Asynchronous.
in synchronize-supervisor read assigment information from zk, supervisor download necessary file from nimbus and write local state. In aother thread sync-processes funciton read local state to launch workor process, when the worker process has not start ,synchronize-supervisor function is called again topology's assignment information has changed (cased by rebalance,or worker time out etc) worker assignment to this supervisor has move to another supervisor, synchronize-supervisor remove the unnecessary file (jar file and ser file etc.) , after this, worker launched by " sync-processes" ,ser file was not exsit , this issue occur.
可能解决办法
换一个storm调整参数
Change "synchronize-supervisor" thread loop time to a longger than 10(default time) sec, such as 30 sec。
supervisor.worker.timeout.secs: 30 -> 5
References:
https://issues.apache.org/jira/browse/STORM-130 http://storm.apache.org/documentation/Lifecycle-of-a-topology.html
相关文章推荐
- java实现发送手机短信
- spring AOP自定义注解方式实现日志管理
- springMVC 一些文章
- javaBean与Map相互转化
- 史上最全最强SpringMVC详细示例 实战
- [Java代码] java数学表达式计算 QLExpress
- java的Calendar和Date类
- java 四种线程池
- 用MyEclipse将java文件转换成UML类图
- java 遍历一个list的时候 然后修改 会报错
- Eclipse Kepler中安装Drools6插件
- java内存调优
- Java 下一代: 对比并发性
- 学习Struts2_0500_actionMethod_DMI
- java中try-catch-finally
- java代码检查工具-FindBugs
- JAVA当中变量什么时候需要初始化?
- java 树
- java中volatile关键字的含义
- 日期工具类