您的位置:首页 > Web前端 > Node.js

如何解决Hadoop集群环境下DataNode无法连接NameNode问题

2017-09-22 23:02 591 查看

简介

本文总结了在Hadoop集群环境下,DataNode无法连接NameNode的问题:2017-02-13 05:43:01,540 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Problem connecting to server: hadoop-master-vm/10.220.33.37:9000,重点在于问题的排除思路和方法。该问题出现的运行环境为Ubuntu 16.05 TLS, Hadoop 2.7.3环境下。

问题描述

按照博文Ubuntu环境下Hadoop集群/分布式环境配置搭建Hadoop集群环境后,在运行hdfs dfsadmin -report查看分布式文件系统信息时发现无法显示相关数据:
hadoop@hadoop-master-vm:~$ hdfs dfsadmin -report
Configured Capacity: 0 (0 B)
Present Capacity: 0 (0 B)
DFS Remaining: 0 (0 B)
DFS Used: 0 (0 B)
DFS Used%: NaN%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
Missing blocks (with replication factor 1): 0

-------------------------------------------------
登录http://hadoop-master-vm:50070后,DataNode列表显示为空。

问题分析与排查

这里我们来总结一下此问题的排查思路
1. 遇到该类问题时,首先需要确认的是相关的服务进程是否已经启动起来,这里可以通过jps命令来查看。如果能够正常启动的话,在Master节点上,可以看到NameNode、ResourceManager、SecondaryNameNode、JobHisotryServer服务; 在Slave节点上可以看到DataNode和NodeManager服务。通过jps命令发现相关的服务都存在。
2. 接下来需要查看的就是系统日志。系统日志位于Hadoop安装目录的logs子目录下。我们查看其中一个DataNode的日志文件$INSTALL_HADOOP/logs/hadoop-hadoop-datanode-hadoop-slave01-vm.log,发现了下面的异常信息:

2017-02-13 05:24:52,166 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Problem connecting to server: hadoop-master-vm/10.220.33.37:9000
2017-02-13 05:24:58,168 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop-master-vm/10.220.33.37:9000. 
Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2017-02-13 05:24:59,169 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoop-master-vm/10.220.33.37:9000.
Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

从日志上分析,DataNode无法连接NameNode。此时可去排查NameNode的问题
3. 在NameNode上,使用netstat -l来查看端口信息,发现服务端口9000工作正常:
hadoop@hadoop-master-vm:~$ netstat -l
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0      0 127.0.1.1:10020         *:*                     LISTEN
tcp        0      0 hadoop-master-vm:9000   *:*                     LISTEN
tcp        0      0 hadoop-master-vm:50090  *:*                     LISTEN
tcp        0      0 *:netbios-ssn           *:*                     LISTEN
tcp        0      0 127.0.1.1:19888         *:*                     LISTEN
tcp        0      0 *:x11-1                 *:*                     LISTEN
此时怀疑是否因为系统防火墙配置导致过滤了9000端口报文,尝试用sudo ufw allow 9000命令运行防火墙通过相关端口的报文,未果。重新审视了netstat -l的输出,发现10020和19888端口侦听地址比较奇怪,为127.0.1.1。但在mapred-site.xml配置文件中,我们是以hostname来配置的:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>hadoop-master-vm:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>hadoop-master-vm:19888</value>
</property>
</configuration>
netstat -l显示的信息与mapred-site.xml配置冲突,实际上应该监听hadoop-master-vm上来的数据,而非127.0.1.1。查看/etc/hosts文件,发现有127.0.1.1对应的域名:
hadoop@hadoop-master-vm:~/hadoop-2.7.3/logs$ cat /etc/hosts
127.0.0.1       localhost
127.0.1.1      hadoop-master-vm

10.220.33.37    hadoop-master-vm
10.220.33.36    hadoop-slave01-vm
10.220.33.35    hadoop-slave02-vm
10.220.33.34    hadoop-slave03-vm

# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
尝试注释掉127.0.1.1的域名,重新启动相关的服务后,重新查看端口侦听信息:
hadoop@hadoop-master-vm:~$ netstat -l
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0      0 hadoop-master-vm:10020  *:*                     LISTEN
tcp        0      0 hadoop-master-vm:9000   *:*                     LISTEN
tcp        0      0 hadoop-master-vm:50090  *:*                     LISTEN
tcp        0      0 *:netbios-ssn           *:*                     LISTEN
tcp        0      0 hadoop-master-vm:19888  *:*                     LISTEN
tcp        0      0 *:10033                 *:*                     LISTEN
现在已经与实际配置一致了,使用hdfs dfsadmin -report命令查看,已经可以显示DataNode信息了
hadoop@hadoop-master-vm:~$ hdfs dfsadmin -report
Configured Capacity: 82035068928 (76.40 GB)
Present Capacity: 53878968320 (50.18 GB)
DFS Remaining: 53878870016 (50.18 GB)
DFS Used: 98304 (96 KB)
DFS Used%: 0.00%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
Missing blocks (with replication factor 1): 0

-------------------------------------------------
Live datanodes (3):

Name: 10.220.33.35:50010 (hadoop-slave02-vm)
Hostname: hadoop-slave02-vm
Decommission Status : Normal
Configured Capacity: 27345022976 (25.47 GB)
DFS Used: 32768 (32 KB)
Non DFS Used: 9382637568 (8.74 GB)
DFS Remaining: 17962352640 (16.73 GB)
DFS Used%: 0.00%
DFS Remaining%: 65.69%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Mon Feb 13 06:50:53 EST 2017

Name: 10.220.33.34:50010 (hadoop-slave03-vm)
Hostname: hadoop-slave03-vm
Decommission Status : Normal
Configured Capacity: 27345022976 (25.47 GB)
DFS Used: 32768 (32 KB)
Non DFS Used: 9390850048 (8.75 GB)
DFS Remaining: 17954140160 (16.72 GB)
DFS Used%: 0.00%
DFS Remaining%: 65.66%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Mon Feb 13 06:50:53 EST 2017

Name: 10.220.33.36:50010 (hadoop-slave01-vm)
Hostname: hadoop-slave01-vm
Decommission Status : Normal
Configured Capacity: 27345022976 (25.47 GB)
DFS Used: 32768 (32 KB)
Non DFS Used: 9382612992 (8.74 GB)
DFS Remaining: 17962377216 (16.73 GB)
DFS Used%: 0.00%
DFS Remaining%: 65.69%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Mon Feb 13 06:50:53 EST 2017

问题总结

/etc/hosts文件配置错误导致了服务侦听地址错误,引起了NameNode和DataNode工作的异常。类似的问题,均可参考本文的排查思路来解决。
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: