如何解决Hadoop集群环境下DataNode无法连接NameNode问题
2017-09-22 23:02
591 查看
简介
本文总结了在Hadoop集群环境下,DataNode无法连接NameNode的问题:2017-02-13 05:43:01,540 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Problem connecting to server: hadoop-master-vm/10.220.33.37:9000,重点在于问题的排除思路和方法。该问题出现的运行环境为Ubuntu 16.05 TLS, Hadoop 2.7.3环境下。问题描述
按照博文Ubuntu环境下Hadoop集群/分布式环境配置搭建Hadoop集群环境后,在运行hdfs dfsadmin -report查看分布式文件系统信息时发现无法显示相关数据:hadoop@hadoop-master-vm:~$ hdfs dfsadmin -report Configured Capacity: 0 (0 B) Present Capacity: 0 (0 B) DFS Remaining: 0 (0 B) DFS Used: 0 (0 B) DFS Used%: NaN% Under replicated blocks: 0 Blocks with corrupt replicas: 0 Missing blocks: 0 Missing blocks (with replication factor 1): 0 -------------------------------------------------登录http://hadoop-master-vm:50070后,DataNode列表显示为空。
问题分析与排查
这里我们来总结一下此问题的排查思路1. 遇到该类问题时,首先需要确认的是相关的服务进程是否已经启动起来,这里可以通过jps命令来查看。如果能够正常启动的话,在Master节点上,可以看到NameNode、ResourceManager、SecondaryNameNode、JobHisotryServer服务; 在Slave节点上可以看到DataNode和NodeManager服务。通过jps命令发现相关的服务都存在。
2. 接下来需要查看的就是系统日志。系统日志位于Hadoop安装目录的logs子目录下。我们查看其中一个DataNode的日志文件$INSTALL_HADOOP/logs/hadoop-hadoop-datanode-hadoop-slave01-vm.log,发现了下面的异常信息:
2017-02-13 05:24:52,166 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Problem connecting to server: hadoop-master-vm/10.220.33.37:9000 2017-02-13 05:24:58,168 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop-master-vm/10.220.33.37:9000. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2017-02-13 05:24:59,169 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop-master-vm/10.220.33.37:9000. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
从日志上分析,DataNode无法连接NameNode。此时可去排查NameNode的问题
3. 在NameNode上,使用netstat -l来查看端口信息,发现服务端口9000工作正常:
hadoop@hadoop-master-vm:~$ netstat -l Active Internet connections (only servers) Proto Recv-Q Send-Q Local Address Foreign Address State tcp 0 0 127.0.1.1:10020 *:* LISTEN tcp 0 0 hadoop-master-vm:9000 *:* LISTEN tcp 0 0 hadoop-master-vm:50090 *:* LISTEN tcp 0 0 *:netbios-ssn *:* LISTEN tcp 0 0 127.0.1.1:19888 *:* LISTEN tcp 0 0 *:x11-1 *:* LISTEN此时怀疑是否因为系统防火墙配置导致过滤了9000端口报文,尝试用sudo ufw allow 9000命令运行防火墙通过相关端口的报文,未果。重新审视了netstat -l的输出,发现10020和19888端口侦听地址比较奇怪,为127.0.1.1。但在mapred-site.xml配置文件中,我们是以hostname来配置的:
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>mapreduce.jobhistory.address</name> <value>hadoop-master-vm:10020</value> </property> <property> <name>mapreduce.jobhistory.webapp.address</name> <value>hadoop-master-vm:19888</value> </property> </configuration>netstat -l显示的信息与mapred-site.xml配置冲突,实际上应该监听hadoop-master-vm上来的数据,而非127.0.1.1。查看/etc/hosts文件,发现有127.0.1.1对应的域名:
hadoop@hadoop-master-vm:~/hadoop-2.7.3/logs$ cat /etc/hosts 127.0.0.1 localhost 127.0.1.1 hadoop-master-vm 10.220.33.37 hadoop-master-vm 10.220.33.36 hadoop-slave01-vm 10.220.33.35 hadoop-slave02-vm 10.220.33.34 hadoop-slave03-vm # The following lines are desirable for IPv6 capable hosts ::1 ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters尝试注释掉127.0.1.1的域名,重新启动相关的服务后,重新查看端口侦听信息:
hadoop@hadoop-master-vm:~$ netstat -l Active Internet connections (only servers) Proto Recv-Q Send-Q Local Address Foreign Address State tcp 0 0 hadoop-master-vm:10020 *:* LISTEN tcp 0 0 hadoop-master-vm:9000 *:* LISTEN tcp 0 0 hadoop-master-vm:50090 *:* LISTEN tcp 0 0 *:netbios-ssn *:* LISTEN tcp 0 0 hadoop-master-vm:19888 *:* LISTEN tcp 0 0 *:10033 *:* LISTEN现在已经与实际配置一致了,使用hdfs dfsadmin -report命令查看,已经可以显示DataNode信息了
hadoop@hadoop-master-vm:~$ hdfs dfsadmin -report Configured Capacity: 82035068928 (76.40 GB) Present Capacity: 53878968320 (50.18 GB) DFS Remaining: 53878870016 (50.18 GB) DFS Used: 98304 (96 KB) DFS Used%: 0.00% Under replicated blocks: 0 Blocks with corrupt replicas: 0 Missing blocks: 0 Missing blocks (with replication factor 1): 0 ------------------------------------------------- Live datanodes (3): Name: 10.220.33.35:50010 (hadoop-slave02-vm) Hostname: hadoop-slave02-vm Decommission Status : Normal Configured Capacity: 27345022976 (25.47 GB) DFS Used: 32768 (32 KB) Non DFS Used: 9382637568 (8.74 GB) DFS Remaining: 17962352640 (16.73 GB) DFS Used%: 0.00% DFS Remaining%: 65.69% Configured Cache Capacity: 0 (0 B) Cache Used: 0 (0 B) Cache Remaining: 0 (0 B) Cache Used%: 100.00% Cache Remaining%: 0.00% Xceivers: 1 Last contact: Mon Feb 13 06:50:53 EST 2017 Name: 10.220.33.34:50010 (hadoop-slave03-vm) Hostname: hadoop-slave03-vm Decommission Status : Normal Configured Capacity: 27345022976 (25.47 GB) DFS Used: 32768 (32 KB) Non DFS Used: 9390850048 (8.75 GB) DFS Remaining: 17954140160 (16.72 GB) DFS Used%: 0.00% DFS Remaining%: 65.66% Configured Cache Capacity: 0 (0 B) Cache Used: 0 (0 B) Cache Remaining: 0 (0 B) Cache Used%: 100.00% Cache Remaining%: 0.00% Xceivers: 1 Last contact: Mon Feb 13 06:50:53 EST 2017 Name: 10.220.33.36:50010 (hadoop-slave01-vm) Hostname: hadoop-slave01-vm Decommission Status : Normal Configured Capacity: 27345022976 (25.47 GB) DFS Used: 32768 (32 KB) Non DFS Used: 9382612992 (8.74 GB) DFS Remaining: 17962377216 (16.73 GB) DFS Used%: 0.00% DFS Remaining%: 65.69% Configured Cache Capacity: 0 (0 B) Cache Used: 0 (0 B) Cache Remaining: 0 (0 B) Cache Used%: 100.00% Cache Remaining%: 0.00% Xceivers: 1 Last contact: Mon Feb 13 06:50:53 EST 2017
问题总结
/etc/hosts文件配置错误导致了服务侦听地址错误,引起了NameNode和DataNode工作的异常。类似的问题,均可参考本文的排查思路来解决。相关文章推荐
- 解决hadoop集群环境datanode无法启动的问题
- hadoop多次格式化namenode造成datanode无法启动问题解决
- 在启动HDFS时,针对集群中namenode无法识别datanode的问题的解决方法
- hadoop多次格式化namenode造成datanode无法启动问题解决
- 解决更改hadoop核心配置文件后会出现DataNode,或者NameNode无法启动的问题
- hadoop DataNode无法连接NameNode问题,注意/etc/hosts内容
- 停电后hadoop集群重启 DataNode无法连接NameNode
- hadoop多次格式化namenode造成datanode无法启动问题解决
- 搭建Hadoop2集群出现Datanode启动不了的问题及解决办法
- hadoop 平台解决datanode无法启动问题
- 重新格式化hadoop的namenode导致datanode无法启动的最简单解决办法
- hadoop启动后通过jps查看进程datanode或namenode不存在问题解决
- hadoop datanode 无法启动之 namenode ID 不一致解决办法。
- Hadoop 在重启或者多次格式化后无法启动datanode问题的解决
- 解决hadoop集群中datanode启动后自动关闭的问题
- hadoop 节点链接不通、datanode无法启动问题解决
- hadoop生产集群离线datanode(遇到的问题及解决方法)
- 部署hadoop后,datanode无法连接namenode
- Hadoop的DataNode无法启动问题解决
- hadoop集群启动后datanode和namenodemanager关闭问题解决