机房断电引起hadoop hdfs corrupt blocks
2016-06-06 15:39
615 查看
周末公司紧急停电引起机房hadoop测试集群断电,当周一回来准备重启集群发现Cloudera Manager报HDFS块损坏的错误。我们CDH测试集群上面有HBase集群和Solr集群的数据保存在HDFS上。
shell命令行下执行:
发现存在大量corrupt blocks
输出如下所示:
留意:
Corrupt blocks: 560
2) 尝试启动solr集群,重启后再执行
发现仅剩 hbase 相关的corrupt blocks
3) 尝试重启hbase集群,重启后再执行
观察hbase master web ui,发现C_PICRECORD表的region全部为offline,C_PICRECORD_IDX_COLLISION有一个region为offline.
shell下执行
得到如下结果:
尝试使用
4) 通过
【注意】
通过
shell命令行下执行:
hdfs fsck /
发现存在大量corrupt blocks
输出如下所示:
FSCK started by root (auth:SIMPLE) from /172.16.8.165 for path / at Mon Jun 06 19:15:02 CST 2016 .................................................................................................... .................................................................................................... .................................................................................................... .......... /hbase/data/default/C_PICRECORD_IDX_COLLISION/147d35ece51f9930b851faaee205067c/info/a028095ab9474f3bb8edb9a98d3a0e53: Under replicated BP-1471870221-192.168.27.165-1461916959086:blk_1078324588_4583931. Target Replicas is 3 but found 1 replica(s). ..................... /hbase/data/default/C_PICRECORD_IDX_COLLISION/5181381e418e083a907116b3c4c76551/info/9c5aaf4c46e34422a00d041fd6d25b19: Under replicated BP-1471870221-192.168.27.165-1461916959086:blk_1078324587_4583930. Target Replicas is 3 but found 1 replica(s). ..... /hbase/data/default/C_PICRECORD_IDX_COLLISION/58b932aad968d950415d099e39b3dc5a/info/2e691116caba4872be64d64c115b1ca7: Under replicated BP-1471870221-192.168.27.165-1461916959086:blk_1078324585_4583928. Target Replicas is 3 but found 2 replica(s). ................. /hbase/data/default/C_PICRECORD_IDX_COLLISION/7867301490e544790671057d7e18ee28/info/39f7d025b4504c38b514e80c43b721f7: CORRUPT blockpool BP-1471870221-192.168.27.165-1461916959086 block blk_1078324602 /hbase/data/default/C_PICRECORD_IDX_COLLISION/7867301490e544790671057d7e18ee28/info/39f7d025b4504c38b514e80c43b721f7: MISSING 1 blocks of total size 127604197 B............................... /hbase/data/default/C_PICRECORD_IDX_COLLISION/eb89de3a0550e56e3625c0bd87f592c3/info/32027e5b58274abbb1733b9e9d62a4b3: Under replicated BP-1471870221-192.168.27.165-1461916959086:blk_1078324586_4583929. Target Replicas is 3 but found 1 replica(s). .... ................. .................................................................................................... .................................................................................................... .................................................................................................... .................................................................................................... .................................................................................................... .................................................................................................... .................................................................................................... ..................................................................... /solr/C_PICRECORD/core_node1/data/index/_3rq7.fdt: CORRUPT blockpool BP-1471870221-192.168.27.165-1461916959086 block blk_1078324798 /solr/C_PICRECORD/core_node1/data/index/_3rq7.fdt: MISSING 1 blocks of total size 778472 B.. /solr/C_PICRECORD/core_node1/data/index/_3rq7.fdx: CORRUPT blockpool BP-1471870221-192.168.27.165-1461916959086 block blk_1078324784 ... .................................................................................................... .................................................................................................... .................................................................................................... .................................................................................................... .................................................................................................... .................................................................................................... .................................................................................................... ..................................Status: CORRUPT Total size: 111683665394 B (Total open files size: 326518049 B) Total dirs: 1301 Total files: 3634 Total symlinks: 0 (Files currently being written: 29) Total blocks (validated): 3925 (avg. block size 28454437 B) (Total open file blocks (not validated): 29) ******************************** CORRUPT FILES: 560 MISSING BLOCKS: 560 MISSING SIZE: 209925836 B CORRUPT BLOCKS: 560 ******************************** Minimally replicated blocks: 3365 (85.73248 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 4 (0.10191083 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 3 Average block replication: 2.5640764 Corrupt blocks: 560 Missing replicas: 7 (0.0595694 %) Number of data-nodes: 3 Number of racks: 1 FSCK ended at Mon Jun 06 19:15:02 CST 2016 in 207 milliseconds The filesystem under path '/' is CORRUPT
留意:
Corrupt blocks: 560
2) 尝试启动solr集群,重启后再执行
hdfs fsck /
发现仅剩 hbase 相关的corrupt blocks
3) 尝试重启hbase集群,重启后再执行
hdfs fsck /,发现相关的corrupt blocks依然存在
观察hbase master web ui,发现C_PICRECORD表的region全部为offline,C_PICRECORD_IDX_COLLISION有一个region为offline.
shell下执行
hbase hbck
得到如下结果:
HBaseFsck command line options: Version: 1.0.0-cdh5.4.2 Number of live region servers: 3 Number of dead region servers: 0 Master: master,60000,1465219825898 Number of backup masters: 0 Average load: 96.66666666666667 Number of requests: 14 Number of regions: 290 Number of regions in transition: 16 Number of empty REGIONINFO_QUALIFIER rows in hbase:meta: 0 Number of Tables: 8 ERROR: Region { meta => C_PICRECORD,\x0B,1464994589166.002457068da67399d7b3e199f1c36cc1., hdfs => hdfs://master:8020/hbase/data/default/C_PICRECORD/002457068da67399d7b3e199f1c36cc1, deployed => , replicaId => 0 } not deployed on any region server. ERROR: Region { meta => C_PICRECORD,\x0A,1464994589166.0bac6481608a98607d30f510317e5a3f., hdfs => hdfs://master:8020/hbase/data/default/C_PICRECORD/0bac6481608a98607d30f510317e5a3f, deployed => , replicaId => 0 } not deployed on any region server. ERROR: Region { meta => C_PICRECORD,\x0C,1464994589166.0cbbb1a413145d9289c096f0fb3f0d5e., hdfs => hdfs://master:8020/hbase/data/default/C_PICRECORD/0cbbb1a413145d9289c096f0fb3f0d5e, deployed => , replicaId => 0 } not deployed on any region server. ERROR: Region { meta => C_PICRECORD,\x08,1464994589166.16e1bd458911637b07e16e35e2a5d700., hdfs => hdfs://master:8020/hbase/data/default/C_PICRECORD/16e1bd458911637b07e16e35e2a5d700, deployed => , replicaId => 0 } not deployed on any region server. ERROR: Region { meta => C_PICRECORD,\x04,1464994589166.300652a5c355b5be8ecf0f7f29fe6e23., hdfs => hdfs://master:8020/hbase/data/default/C_PICRECORD/300652a5c355b5be8ecf0f7f29fe6e23, deployed => , replicaId => 0 } not deployed on any region server. ERROR: Region { meta => C_PICRECORD,\x01,1464994589166.5461981c1498b8c031a723e41f92899d., hdfs => hdfs://master:8020/hbase/data/default/C_PICRECORD/5461981c1498b8c031a723e41f92899d, deployed => , replicaId => 0 } not deployed on any region server. ERROR: Region { meta => C_PICRECORD,\x05,1464994589166.6530d18ae3d09d0570bb98f56a25254c., hdfs => hdfs://master:8020/hbase/data/default/C_PICRECORD/6530d18ae3d09d0570bb98f56a25254c, deployed => , replicaId => 0 } not deployed on any region server. ERROR: Region { meta => C_PICRECORD,,1464994589166.751a488c823dc5e0384273fc6cf9435c., hdfs => hdfs://master:8020/hbase/data/default/C_PICRECORD/751a488c823dc5e0384273fc6cf9435c, deployed => , replicaId => 0 } not deployed on any region server. ERROR: Region { meta => C_PICRECORD_IDX_COLLISION,\x06\x00\x00\x00\x00\x00,1464994612530.7867301490e544790671057d7e18ee28., hdfs => hdfs://master:8020/hbase/data/default/C_PICRECORD_IDX_COLLISION/7867301490e544790671057d7e18ee28, deployed => , replicaId => 0 } not deployed on any region server. ERROR: Region { meta => C_PICRECORD,\x02,1464994589166.8da6d9faebabeae4cc4f7591a0bddf1a., hdfs => hdfs://master:8020/hbase/data/default/C_PICRECORD/8da6d9faebabeae4cc4f7591a0bddf1a, deployed => , replicaId => 0 } not deployed on any region server. ERROR: Region { meta => C_PICRECORD,\x06,1464994589166.a0016ea8564da9b848012860921bd612., hdfs => hdfs://master:8020/hbase/data/default/C_PICRECORD/a0016ea8564da9b848012860921bd612, deployed => , replicaId => 0 } not deployed on any region server. ERROR: Region { meta => C_PICRECORD,\x09,1464994589166.a574bd7a2e6d49d25cd6fa51510155e1., hdfs => hdfs://master:8020/hbase/data/default/C_PICRECORD/a574bd7a2e6d49d25cd6fa51510155e1, deployed => , replicaId => 0 } not deployed on any region server. ERROR: Region { meta => C_PICRECORD,\x03,1464994589166.b2e9309a7d0dca1218d3dcbef5511f1a., hdfs => hdfs://master:8020/hbase/data/default/C_PICRECORD/b2e9309a7d0dca1218d3dcbef5511f1a, deployed => , replicaId => 0 } not deployed on any region server. ERROR: Region { meta => C_PICRECORD,\x0E,1464994589166.b9a199b401d9e21383551e8ef4f0a090., hdfs => hdfs://master:8020/hbase/data/default/C_PICRECORD/b9a199b401d9e21383551e8ef4f0a090, deployed => , replicaId => 0 } not deployed on any region server. ERROR: Region { meta => C_PICRECORD,\x07,1464994589166.bd0d12dbbdbe4a3c01cb6c818172b083., hdfs => hdfs://master:8020/hbase/data/default/C_PICRECORD/bd0d12dbbdbe4a3c01cb6c818172b083, deployed => , replicaId => 0 } not deployed on any region server. ERROR: Region { meta => C_PICRECORD,\x0D,1464994589166.ce84407fd5dfaad56552c04620a4745c., hdfs => hdfs://master:8020/hbase/data/default/C_PICRECORD/ce84407fd5dfaad56552c04620a4745c, deployed => , replicaId => 0 } not deployed on any region server. ERROR: There is a hole in the region chain between \x06\x00\x00\x00\x00\x00 and \x07\x00\x00\x00\x00\x00. You need to create a new .regioninfo and region dir in hdfs to plug the hole. ERROR: Found inconsistency in table C_PICRECORD_IDX_COLLISION ERROR: There is a hole in the region chain between and . You need to create a new .regioninfo and region dir in hdfs to plug the hole. ERROR: Found inconsistency in table C_PICRECORD Summary: hbase:meta is okay. Number of regions: 1 Deployed on: slave2,60020,1465219825115 C_PICRECORD_IDX_COLLISION is okay. Number of regions: 14 Deployed on: slave1,60020,1465219825663 slave2,60020,1465219825115 slave3,60020,1465219824655 SYSTEM.CATALOG is okay. Number of regions: 1 Deployed on: slave3,60020,1465219824655 C_PICRECORD is okay. Number of regions: 0 Deployed on: hbase:namespace is okay. Number of regions: 1 Deployed on: slave3,60020,1465219824655 SYSTEM.SEQUENCE is okay. Number of regions: 256 Deployed on: slave1,60020,1465219825663 slave2,60020,1465219825115 slave3,60020,1465219824655 SYSTEM.FUNCTION is okay. Number of regions: 1 Deployed on: slave3,60020,1465219824655 C_PICRECORD_IDX is okay. Number of regions: 15 Deployed on: slave1,60020,1465219825663 slave2,60020,1465219825115 slave3,60020,1465219824655 SYSTEM.STATS is okay. Number of regions: 1 Deployed on: slave3,60020,1465219824655 18 inconsistencies detected. Status: INCONSISTENT
尝试使用
hbase hbck -fix以及
hbase hbck -repair命令来修复,结果失败
4) 通过
hdfs fsck / -delete直接干掉坏掉的hbase corrupt blocks,然后重启hbase集群,发现region全部online,问题解决
【注意】
通过
hdfs fsck / -delete方式删除了坏掉的hdfs block会造成数据丢失。暂时没有找到完美解决方案来修复坏掉的块,期待更高明的解决手段!
相关文章推荐
- 详解HDFS Short Circuit Local Reads
- Hadoop_2.1.0 MapReduce序列图
- 使用Hadoop搭建现代电信企业架构
- 单机版搭建Hadoop环境图文教程详解
- hadoop常见错误以及处理方法详解
- hadoop 单机安装配置教程
- hadoop的hdfs文件操作实现上传文件到hdfs
- hadoop实现grep示例分享
- Apache Hadoop版本详解
- linux下搭建hadoop环境步骤分享
- java连接hdfs ha和调用mapreduce jar示例
- java实现将ftp和http的文件直接传送到hdfs
- hadoop client与datanode的通信协议分析
- hadoop中一些常用的命令介绍
- Hadoop单机版和全分布式(集群)安装
- 用PHP和Shell写Hadoop的MapReduce程序
- hadoop map-reduce中的文件并发操作
- Hadoop1.2中配置伪分布式的实例
- hadoop上传文件功能实例代码