您的位置:首页 > 运维架构

hadoop1.0 TaskTracker因为分布式缓存导致内存泄露的一次问题排查

2014-11-29 18:36 543 查看
       上周五同事到公司说凌晨的时候有值班同事打电话给他,有部分job卡住了,运行了很长时间都没运行完成,由于是凌晨,他没来得及详细的查看日志,简单的把有问题的tasktracker重启了一下,只有一个节点的TaskTracker进程停掉,让我查一下具体是什么问题。以下是排查过程:

1、登陆到停掉TT进程的处理机

(1)、查看磁盘空间



磁盘没有出现空间不足的情况。 

(2)、top查看负载和内存使用情况:



根据上图看出内存和负载都不算高,也不存在僵尸进程。

2、查看进程日志

1、log4j日志:

2014-11-28 06:47:43,813 INFO org.apache.hadoop.mapred.TaskTracker: Trying to launch : attempt_201406171104_5451817_m_001292_0 which needs 1 slots
2014-11-28 06:47:43,813 INFO org.apache.hadoop.mapred.TaskTracker: In TaskLauncher, current free slots : 3 and trying to launch attempt_201406171104_5451817_m_001292_0 which needs 1 slots
2014-11-28 06:47:43,814 INFO org.apache.hadoop.mapred.JvmManager: In JvmRunner constructed JVM ID: jvm_201406171104_5451797_m_-1554592361
2014-11-28 06:47:43,814 INFO org.apache.hadoop.mapred.JvmManager: JVM Runner jvm_201406171104_5451797_m_-1554592361 spawned.
2014-11-28 06:47:43,821 INFO org.apache.hadoop.mapred.TaskController: Writing commands to /data2/hadoop/local/ttprivate/taskTracker/optimus/jobcache/job_201406171104_5451797/attempt_201406171104_5451797_m_000521_0/taskjvm.sh
2014-11-28 06:47:43,992 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 10.39.4.178:50060, dest: 10.39.5.241:47487, bytes: 18, op: MAPRED_SHUFFLE, cliID: attempt_201406171104_5450184_m_001640_0, duration: 235398
2014-11-28 06:47:44,317 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 10.39.4.178:50060, dest: 10.39.0.140:60631, bytes: 18, op: MAPRED_SHUFFLE, cliID: attempt_201406171104_5450184_m_001640_0, duration: 163982
2014-11-28 06:47:44,580 INFO org.apache.hadoop.mapred.TaskTracker: JVM with ID: jvm_201406171104_5451797_m_-1554592361 given task: attempt_201406171104_5451797_m_000521_0
2014-11-28 06:47:44,895 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 10.39.4.178:50060, dest: 10.39.4.164:50407, bytes: 18, op: MAPRED_SHUFFLE, cliID: attempt_201406171104_5450184_m_001640_0, duration: 168406
2014-11-28 06:47:45,057 INFO org.apache.hadoop.mapred.TaskTracker: SHUTDOWN_MSG:

日志没有出现异常。

2、由于进程已经停掉了,所以没法查看进程堆栈和内存映射信息,不过我们在配置hadoop进程的时候,JVM参数处启动了gc日志打印:

vim /etc/bashrc

export JAVA_HOME=/usr/local/jdk1.6.0_21
export JRE_HOME=/usr/local/jdk1.6.0_21/jre
export CLASSPATH=.:$JAVA_HOME/lib:$JAVA_HOME/jre/lib
export HADOOP_HOME=/usr/local/hadoop-0.20.203.0
export HADOOP_LOG_DIR=/data0/hadoop/log
export HADOOP_PID_DIR=/data0/hadoop/pid
export PATH=$JAVA_HOME/bin:$JAVA_HOME/jre/bin:$HADOOP_HOME/bin:$PATH
ulimit -u 65535
export HADOOP_DATANODE_OPTS=" -Xmx4096m -verbose:gc -Xloggc:/data0/hadoop/gclog/datanode.gc.log -XX:ErrorFile=/data0/hadoop/gclog/hs_err_pid.log -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+HeapDumpOnOutOfMemoryError "
export HADOOP_TASKTRACKER_OPTS=" -Xmx4096m -verbose:gc -Xloggc:/data0/hadoop/gclog/tasktracker.gc.log -XX:ErrorFile=/data0/hadoop/gclog/hs_err_pid.log -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+HeapDumpOnOutOfMemoryError "
以下是gc日志信息:

2014-11-28T06:47:08.025+0800: 26433049.397: [Full GC [PSYoungGen: 1340224K->1286605K(1368576K)] [PSOldGen: 2796223K->2796223K(2796224K)] 4136447K->4082829K(4164800K) [PSPermGen: 21554K->21553K(21568K)] GC time would exceed GCTimeLimit of 98%
, 5.5652750 secs] [Times: user=0.00 sys=5.56, real=5.57 secs]
2014-11-28T06:47:18.126+0800: 26433059.499: [Full GC [PSYoungGen: 1340224K->1340224K(1368576K)] [PSOldGen: 2796223K->2796223K(2796224K)] 4136447K->4136447K(4164800K) [PSPermGen: 21554K->21554K(21568K)], 6.6131270 secs] [Times: user=0.00 sys=6.61, real=6.61 secs]
2014-11-28T06:47:24.740+0800: 26433066.112: [Full GC [PSYoungGen: 1340224K->1286378K(1368576K)] [PSOldGen: 2796223K->2796223K(2796224K)] 4136447K->4082602K(4164800K) [PSPermGen: 21554K->21553K(21568K)], 5.5440730 secs] [Times: user=0.00 sys=5.55, real=5.55 secs]
2014-11-28T06:47:30.994+0800: 26433072.367: [Full GC [PSYoungGen: 1340224K->1340224K(1368576K)] [PSOldGen: 2796223K->2796223K(2796224K)] 4136447K->4136447K(4164800K) [PSPermGen: 21554K->21554K(21568K)], 6.7307300 secs] [Times: user=0.00 sys=6.73, real=6.73 secs]
2014-11-28T06:47:37.725+0800: 26433079.098: [Full GC [PSYoungGen: 1340224K->1287718K(1368576K)] [PSOldGen: 2796223K->2796223K(2796224K)] 4136447K->4083942K(4164800K) [PSPermGen: 21554K->21553K(21568K)], 5.7407480 secs] [Times: user=0.00 sys=5.75, real=5.74 secs]
Heap
PSYoungGen total 1368576K, used 1330631K [0x00007f66b3ab0000, 0x00007f6709000000, 0x00007f6709000000)
eden space 1340224K, 99% used [0x00007f66b3ab0000,0x00007f6704e21d20,0x00007f6705780000)
from space 28352K, 0% used [0x00007f6705780000,0x00007f6705780000,0x00007f6707330000)
to space 29504K, 0% used [0x00007f6707330000,0x00007f6707330000,0x00007f6709000000)
PSOldGen total 2796224K, used 2796223K [0x00007f6609000000, 0x00007f66b3ab0000, 0x00007f66b3ab0000)
object space 2796224K, 99% used [0x00007f6609000000,0x00007f66b3aaffe8,0x00007f66b3ab0000)
PSPermGen total 21760K, used 21574K [0x00007f6603c00000, 0x00007f6605140000, 0x00007f6609000000)
object space 21760K, 99% used [0x00007f6603c00000,0x00007f6605111a98,0x00007f6605140000)
在进程停掉之前老年代内存占用了99%,而且执行了FGC,FGC占用的时间超过了GC时间限制98%。
从这里看出,问题是出在TaskTracker进程由于老年代内存满了,一直在进行FGC,而且FGC是stop the world的,即FGC期间进程是无法提供对外服务的,这就是job任务卡住的根本原因。既然找到了是因为TaskTracker进程堆内存的原因,那么是什么东西导致堆内存被占满,我们在配置TaskTracker堆内存的时候配置了4G内存,按理说是完全够用的,难不成有内存泄露,为了确认是不是内存泄露,需要dump进程的内存信息,登陆到另外一个TaskTracker进程存活的节点,执行jstat
-gcutil ${pid}:

[mapred@yz4175 ~]$ jstat -gcutil 27617
S0 S1 E O P YGC YGCT FGC FGCT GCT
0.00 0.00 100.00 100.00 99.85 621679 20840.395 47500 251938.770 272779.165
0.00 0.00 78.03 100.00 99.76 621679 20840.395 47500 251945.371 272785.766
0.00 0.00 78.41 100.00 99.76 621679 20840.395 47500 251945.371 272785.766
0.00 0.00 78.77 100.00 99.76 621679 20840.395 47500 251945.371 272785.766
令人很惊讶的是看似正常的节点,是乎也存在相同的问题,存在内存泄露,老年代内存占用已经100%了,并一直在执行FGC。
执行:

jmap -histo:live 27617  > jmap.log




从jmap看出,TrackerDistributedCacheManager$CacheStatus和TaskDistributedCacheManager$CacheFile实例有100多万个,TaskDistributedCacheManager实例有86万个,这几个对象都涉及到tasktracker的分布式缓存,难不成是分布式缓存有问题。

在hadoop的jira上面搜了一下cachestatus,果然搜到了:

https://issues.apache.org/jira/browse/MAPREDUCE-3343?jql=text%20~%20%22cachestatus%22

根据jira上面的说明job在TT上面初始化的时候会以jobId为key,TaskDistributedCacheManager为value被设置到jobArchives的map对象中, 但是job完成之后,TT没有立即清除jobArchives对象中相关job的分布式缓存信息,这个问题要解决很简单,只需要在job完成之后清除分布式缓存就可以了,jira中又对应的patch。
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息