Master只能控制其中1台RegionServer,无法控制其它RegionServer原因分析
2014-05-10 10:23
204 查看
最近在测试HBase时遇到一个非常奇怪的问题:集群有7台机器,其中1台Master,6台RegionServer。但是Master只能控制其中1台RegionServer,而无法控制其他5台RegionServer。
打开master的日志文件,发现以下错误信息:
2011-04-22 16:37:21,242 WARN org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of -ROOT-,,0.70236052 to serverName=hp2,60020,1303461559353, load=(requests=0, regions=0, usedHeap=28, maxHeap=3979), trying to assign elsewhere instead; retry=0
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed setting up proxy interface org.apache.hadoop.hbase.ipc.HRegionInterface to /10.131.18.3:60020 after attempts=1
at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:355)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:965)
at org.apache.hadoop.hbase.master.ServerManager.getServerConnection(ServerManager.java:606)
at org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:541)
at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:920)
at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:730)
at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:710)
at org.apache.hadoop.hbase.master.AssignmentManager.assignRoot(AssignmentManager.java:1189)
at org.apache.hadoop.hbase.master.HMaster.assignRootAndMeta(HMaster.java:432)
at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:389)
at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:283)
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:408)
at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:328)
at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:883)
at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:750)
at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
at $Proxy7.getProtocolVersion(Unknown Source)
at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:419)
at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:393)
at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:444)
at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:349)
... 10 more
复制代码
在这个日志中,master机器无法与IP地址为10.131.18.3的regionserver进行通信。
然后找到10.131.18.3机器,查看这台机器的regionserver日志,查看regionserver的启动信息:
2011-04-14 18:32:05,122 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 10 on 60020: starting
2011-04-14 18:32:05,122 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 11 on 60020: starting
2011-04-14 18:32:05,122 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 12 on 60020: starting
2011-04-14 18:32:05,122 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 13 on 60020: starting
2011-04-14 18:32:05,122 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 14 on 60020: starting
2011-04-14 18:32:05,122 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 15 on 60020: starting
2011-04-14 18:32:05,122 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 16 on 60020: starting
2011-04-14 18:32:05,122 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 17 on 60020: starting
2011-04-14 18:32:05,123 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 18 on 60020: starting
2011-04-14 18:32:05,123 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 19 on 60020: starting
2011-04-14 18:32:05,123 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 0 on 60020: starting2011-04-14 18:32:05,123 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 1 on 60020: starting
2011-04-14 18:32:05,123 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 2 on 60020: starting2011-04-14 18:32:05,123 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 3 on 60020: starting
2011-04-14 18:32:05,123 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 4 on 60020: starting2011-04-14 18:32:05,123 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 5 on 60020: starting
2011-04-14 18:32:05,123 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 6 on 60020: starting2011-04-14 18:32:05,124 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 7 on 60020: starting
2011-04-14 18:32:05,124 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 8 on 60020: starting2011-04-14 18:32:05,124 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 9 on 60020: starting2011-04-14 18:32:05,124 INFO org.apache.hadoop.hbase.regionserver.HRegionServer:
Serving as dell4,60020,1302777124101, RPC listening on /127.0.0.1:60020, sessionid=0x12f535856620004
复制代码
可以看出,这台regionserver机器启动成功了,但是RPC的监听ip地址却是本机的地址(127.0.0.1)。这样的话,master机器就无法与这台regionserver正常通信了,正确的监听地址应该是10.131.18.3才对。
查看代码,RPC监听地址的代码如下:
/** @return Bind address */
public String getBindAddress() {
final InetAddress addr = address.getAddress();
if (addr != null) {
return addr.getHostAddress();
} else {
LogFactory. getLog(HServerAddress.class).error( "Could not resolve the"
+ " DNS name of " + stringValue );
return null;
}
}
复制代码
代码没有错,看来是机器的某些配置导致java读取本机的ip地址出现了错误。最后查看这台机器的hosts文件:
[hadoop@hp2 logs]$ vi /etc/hosts
# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1 hp2 localhost.localdomain localhost
::1 localhost6.localdomain6 localhost6
10.131.18.8 dell1
10.131.18.5 dell2
10.131.18.6 dell3
10.131.18.7 dell4
10.131.18.2 hp1
10.131.18.3 hp2
10.131.18.4 hp3
问题找到了,其实是hosts文件的配置原因,接下来修改hosts文件为如下:
# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1 localhost.localdomain localhost
::1 localhost6.localdomain6 localhost6
10.131.18.8 dell1
10.131.18.5 dell2
10.131.18.6 dell3
10.131.18.7 dell4
10.131.18.2 hp1
10.131.18.3 hp2
10.131.18.4 hp3
再次启动整个集群,问题解决。
打开master的日志文件,发现以下错误信息:
2011-04-22 16:37:21,242 WARN org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of -ROOT-,,0.70236052 to serverName=hp2,60020,1303461559353, load=(requests=0, regions=0, usedHeap=28, maxHeap=3979), trying to assign elsewhere instead; retry=0
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed setting up proxy interface org.apache.hadoop.hbase.ipc.HRegionInterface to /10.131.18.3:60020 after attempts=1
at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:355)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:965)
at org.apache.hadoop.hbase.master.ServerManager.getServerConnection(ServerManager.java:606)
at org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:541)
at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:920)
at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:730)
at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:710)
at org.apache.hadoop.hbase.master.AssignmentManager.assignRoot(AssignmentManager.java:1189)
at org.apache.hadoop.hbase.master.HMaster.assignRootAndMeta(HMaster.java:432)
at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:389)
at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:283)
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:408)
at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:328)
at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:883)
at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:750)
at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
at $Proxy7.getProtocolVersion(Unknown Source)
at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:419)
at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:393)
at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:444)
at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:349)
... 10 more
复制代码
在这个日志中,master机器无法与IP地址为10.131.18.3的regionserver进行通信。
然后找到10.131.18.3机器,查看这台机器的regionserver日志,查看regionserver的启动信息:
2011-04-14 18:32:05,122 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 10 on 60020: starting
2011-04-14 18:32:05,122 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 11 on 60020: starting
2011-04-14 18:32:05,122 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 12 on 60020: starting
2011-04-14 18:32:05,122 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 13 on 60020: starting
2011-04-14 18:32:05,122 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 14 on 60020: starting
2011-04-14 18:32:05,122 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 15 on 60020: starting
2011-04-14 18:32:05,122 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 16 on 60020: starting
2011-04-14 18:32:05,122 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 17 on 60020: starting
2011-04-14 18:32:05,123 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 18 on 60020: starting
2011-04-14 18:32:05,123 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 19 on 60020: starting
2011-04-14 18:32:05,123 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 0 on 60020: starting2011-04-14 18:32:05,123 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 1 on 60020: starting
2011-04-14 18:32:05,123 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 2 on 60020: starting2011-04-14 18:32:05,123 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 3 on 60020: starting
2011-04-14 18:32:05,123 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 4 on 60020: starting2011-04-14 18:32:05,123 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 5 on 60020: starting
2011-04-14 18:32:05,123 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 6 on 60020: starting2011-04-14 18:32:05,124 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 7 on 60020: starting
2011-04-14 18:32:05,124 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 8 on 60020: starting2011-04-14 18:32:05,124 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 9 on 60020: starting2011-04-14 18:32:05,124 INFO org.apache.hadoop.hbase.regionserver.HRegionServer:
Serving as dell4,60020,1302777124101, RPC listening on /127.0.0.1:60020, sessionid=0x12f535856620004
复制代码
可以看出,这台regionserver机器启动成功了,但是RPC的监听ip地址却是本机的地址(127.0.0.1)。这样的话,master机器就无法与这台regionserver正常通信了,正确的监听地址应该是10.131.18.3才对。
查看代码,RPC监听地址的代码如下:
/** @return Bind address */
public String getBindAddress() {
final InetAddress addr = address.getAddress();
if (addr != null) {
return addr.getHostAddress();
} else {
LogFactory. getLog(HServerAddress.class).error( "Could not resolve the"
+ " DNS name of " + stringValue );
return null;
}
}
复制代码
代码没有错,看来是机器的某些配置导致java读取本机的ip地址出现了错误。最后查看这台机器的hosts文件:
[hadoop@hp2 logs]$ vi /etc/hosts
# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1 hp2 localhost.localdomain localhost
::1 localhost6.localdomain6 localhost6
10.131.18.8 dell1
10.131.18.5 dell2
10.131.18.6 dell3
10.131.18.7 dell4
10.131.18.2 hp1
10.131.18.3 hp2
10.131.18.4 hp3
问题找到了,其实是hosts文件的配置原因,接下来修改hosts文件为如下:
# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1 localhost.localdomain localhost
::1 localhost6.localdomain6 localhost6
10.131.18.8 dell1
10.131.18.5 dell2
10.131.18.6 dell3
10.131.18.7 dell4
10.131.18.2 hp1
10.131.18.3 hp2
10.131.18.4 hp3
再次启动整个集群,问题解决。
相关文章推荐
- 各种Android Dialog创建及其监听事件实现
- live555简介
- 优化hbase的查询优化-大幅提升读写速率
- 马哥学习笔记八——LAMP编译安装之PHP及xcache
- Merge k Sorted Lists
- ALSA声卡驱动中的DAPM详解之四:在驱动程序中初始化并注册widget和route
- C\C++知识点累加器
- leetcode第一刷_Search in Rotated Sorted Array
- HDU 4279 Number(2012年天津网络赛---数论分析题)
- AAC ADTS解析
- TCP/IP协议
- Ubuntu下网络代理设置
- Git时代的VIM不完全使用教程
- 初窥决策树中的ID3和C4.5
- Java线程:概念与原理
- OpenCV 的四大模块
- php使用正则过滤js脚本代码实例
- 2014-5-10-word系列
- 如何正确处理数据库中的Null
- Android 如何监听返回键,弹出一个退出对话框