您的位置:首页 > 运维架构

机房断电引起hadoop hdfs corrupt blocks

2016-06-06 15:39 615 查看
周末公司紧急停电引起机房hadoop测试集群断电,当周一回来准备重启集群发现Cloudera Manager报HDFS块损坏的错误。我们CDH测试集群上面有HBase集群和Solr集群的数据保存在HDFS上。

shell命令行下执行:

hdfs fsck /


发现存在大量corrupt blocks

输出如下所示:

FSCK started by root (auth:SIMPLE) from /172.16.8.165 for path / at Mon Jun 06 19:15:02 CST 2016
....................................................................................................
....................................................................................................
....................................................................................................
..........
/hbase/data/default/C_PICRECORD_IDX_COLLISION/147d35ece51f9930b851faaee205067c/info/a028095ab9474f3bb8edb9a98d3a0e53:  Under replicated BP-1471870221-192.168.27.165-1461916959086:blk_1078324588_4583931. Target Replicas is 3 but found 1 replica(s).
.....................
/hbase/data/default/C_PICRECORD_IDX_COLLISION/5181381e418e083a907116b3c4c76551/info/9c5aaf4c46e34422a00d041fd6d25b19:  Under replicated BP-1471870221-192.168.27.165-1461916959086:blk_1078324587_4583930. Target Replicas is 3 but found 1 replica(s).
.....
/hbase/data/default/C_PICRECORD_IDX_COLLISION/58b932aad968d950415d099e39b3dc5a/info/2e691116caba4872be64d64c115b1ca7:  Under replicated BP-1471870221-192.168.27.165-1461916959086:blk_1078324585_4583928. Target Replicas is 3 but found 2 replica(s).
.................
/hbase/data/default/C_PICRECORD_IDX_COLLISION/7867301490e544790671057d7e18ee28/info/39f7d025b4504c38b514e80c43b721f7: CORRUPT blockpool BP-1471870221-192.168.27.165-1461916959086 block blk_1078324602

/hbase/data/default/C_PICRECORD_IDX_COLLISION/7867301490e544790671057d7e18ee28/info/39f7d025b4504c38b514e80c43b721f7: MISSING 1 blocks of total size 127604197 B...............................
/hbase/data/default/C_PICRECORD_IDX_COLLISION/eb89de3a0550e56e3625c0bd87f592c3/info/32027e5b58274abbb1733b9e9d62a4b3:  Under replicated BP-1471870221-192.168.27.165-1461916959086:blk_1078324586_4583929. Target Replicas is 3 but found 1 replica(s).

....

.................
....................................................................................................
....................................................................................................
....................................................................................................
....................................................................................................
....................................................................................................
....................................................................................................
....................................................................................................
.....................................................................
/solr/C_PICRECORD/core_node1/data/index/_3rq7.fdt: CORRUPT blockpool BP-1471870221-192.168.27.165-1461916959086 block blk_1078324798

/solr/C_PICRECORD/core_node1/data/index/_3rq7.fdt: MISSING 1 blocks of total size 778472 B..
/solr/C_PICRECORD/core_node1/data/index/_3rq7.fdx: CORRUPT blockpool BP-1471870221-192.168.27.165-1461916959086 block blk_1078324784

...

....................................................................................................
....................................................................................................
....................................................................................................
....................................................................................................
....................................................................................................
....................................................................................................
....................................................................................................
..................................Status: CORRUPT
Total size:    111683665394 B (Total open files size: 326518049 B)
Total dirs:    1301
Total files:   3634
Total symlinks:        0 (Files currently being written: 29)
Total blocks (validated):  3925 (avg. block size 28454437 B) (Total open file blocks (not validated): 29)
********************************
CORRUPT FILES:    560
MISSING BLOCKS:   560
MISSING SIZE:     209925836 B
CORRUPT BLOCKS:   560
********************************
Minimally replicated blocks:   3365 (85.73248 %)
Over-replicated blocks:    0 (0.0 %)
Under-replicated blocks:   4 (0.10191083 %)
Mis-replicated blocks:     0 (0.0 %)
Default replication factor:    3
Average block replication: 2.5640764
Corrupt blocks:        560
Missing replicas:      7 (0.0595694 %)
Number of data-nodes:      3
Number of racks:       1
FSCK ended at Mon Jun 06 19:15:02 CST 2016 in 207 milliseconds

The filesystem under path '/' is CORRUPT


留意:

Corrupt blocks: 560

2) 尝试启动solr集群,重启后再执行
hdfs fsck /


发现仅剩 hbase 相关的corrupt blocks

3) 尝试重启hbase集群,重启后再执行
hdfs fsck /
,发现相关的corrupt blocks依然存在

观察hbase master web ui,发现C_PICRECORD表的region全部为offline,C_PICRECORD_IDX_COLLISION有一个region为offline.

shell下执行

hbase hbck


得到如下结果:

HBaseFsck command line options:
Version: 1.0.0-cdh5.4.2
Number of live region servers: 3
Number of dead region servers: 0
Master: master,60000,1465219825898
Number of backup masters: 0
Average load: 96.66666666666667
Number of requests: 14
Number of regions: 290
Number of regions in transition: 16

Number of empty REGIONINFO_QUALIFIER rows in hbase:meta: 0
Number of Tables: 8
ERROR: Region { meta => C_PICRECORD,\x0B,1464994589166.002457068da67399d7b3e199f1c36cc1., hdfs => hdfs://master:8020/hbase/data/default/C_PICRECORD/002457068da67399d7b3e199f1c36cc1, deployed => , replicaId => 0 } not deployed on any region server.
ERROR: Region { meta => C_PICRECORD,\x0A,1464994589166.0bac6481608a98607d30f510317e5a3f., hdfs => hdfs://master:8020/hbase/data/default/C_PICRECORD/0bac6481608a98607d30f510317e5a3f, deployed => , replicaId => 0 } not deployed on any region server.
ERROR: Region { meta => C_PICRECORD,\x0C,1464994589166.0cbbb1a413145d9289c096f0fb3f0d5e., hdfs => hdfs://master:8020/hbase/data/default/C_PICRECORD/0cbbb1a413145d9289c096f0fb3f0d5e, deployed => , replicaId => 0 } not deployed on any region server.
ERROR: Region { meta => C_PICRECORD,\x08,1464994589166.16e1bd458911637b07e16e35e2a5d700., hdfs => hdfs://master:8020/hbase/data/default/C_PICRECORD/16e1bd458911637b07e16e35e2a5d700, deployed => , replicaId => 0 } not deployed on any region server.
ERROR: Region { meta => C_PICRECORD,\x04,1464994589166.300652a5c355b5be8ecf0f7f29fe6e23., hdfs => hdfs://master:8020/hbase/data/default/C_PICRECORD/300652a5c355b5be8ecf0f7f29fe6e23, deployed => , replicaId => 0 } not deployed on any region server.
ERROR: Region { meta => C_PICRECORD,\x01,1464994589166.5461981c1498b8c031a723e41f92899d., hdfs => hdfs://master:8020/hbase/data/default/C_PICRECORD/5461981c1498b8c031a723e41f92899d, deployed => , replicaId => 0 } not deployed on any region server.
ERROR: Region { meta => C_PICRECORD,\x05,1464994589166.6530d18ae3d09d0570bb98f56a25254c., hdfs => hdfs://master:8020/hbase/data/default/C_PICRECORD/6530d18ae3d09d0570bb98f56a25254c, deployed => , replicaId => 0 } not deployed on any region server.
ERROR: Region { meta => C_PICRECORD,,1464994589166.751a488c823dc5e0384273fc6cf9435c., hdfs => hdfs://master:8020/hbase/data/default/C_PICRECORD/751a488c823dc5e0384273fc6cf9435c, deployed => , replicaId => 0 } not deployed on any region server.
ERROR: Region { meta => C_PICRECORD_IDX_COLLISION,\x06\x00\x00\x00\x00\x00,1464994612530.7867301490e544790671057d7e18ee28., hdfs => hdfs://master:8020/hbase/data/default/C_PICRECORD_IDX_COLLISION/7867301490e544790671057d7e18ee28, deployed => , replicaId => 0 } not deployed on any region server.
ERROR: Region { meta => C_PICRECORD,\x02,1464994589166.8da6d9faebabeae4cc4f7591a0bddf1a., hdfs => hdfs://master:8020/hbase/data/default/C_PICRECORD/8da6d9faebabeae4cc4f7591a0bddf1a, deployed => , replicaId => 0 } not deployed on any region server.
ERROR: Region { meta => C_PICRECORD,\x06,1464994589166.a0016ea8564da9b848012860921bd612., hdfs => hdfs://master:8020/hbase/data/default/C_PICRECORD/a0016ea8564da9b848012860921bd612, deployed => , replicaId => 0 } not deployed on any region server.
ERROR: Region { meta => C_PICRECORD,\x09,1464994589166.a574bd7a2e6d49d25cd6fa51510155e1., hdfs => hdfs://master:8020/hbase/data/default/C_PICRECORD/a574bd7a2e6d49d25cd6fa51510155e1, deployed => , replicaId => 0 } not deployed on any region server.
ERROR: Region { meta => C_PICRECORD,\x03,1464994589166.b2e9309a7d0dca1218d3dcbef5511f1a., hdfs => hdfs://master:8020/hbase/data/default/C_PICRECORD/b2e9309a7d0dca1218d3dcbef5511f1a, deployed => , replicaId => 0 } not deployed on any region server.
ERROR: Region { meta => C_PICRECORD,\x0E,1464994589166.b9a199b401d9e21383551e8ef4f0a090., hdfs => hdfs://master:8020/hbase/data/default/C_PICRECORD/b9a199b401d9e21383551e8ef4f0a090, deployed => , replicaId => 0 } not deployed on any region server.
ERROR: Region { meta => C_PICRECORD,\x07,1464994589166.bd0d12dbbdbe4a3c01cb6c818172b083., hdfs => hdfs://master:8020/hbase/data/default/C_PICRECORD/bd0d12dbbdbe4a3c01cb6c818172b083, deployed => , replicaId => 0 } not deployed on any region server.
ERROR: Region { meta => C_PICRECORD,\x0D,1464994589166.ce84407fd5dfaad56552c04620a4745c., hdfs => hdfs://master:8020/hbase/data/default/C_PICRECORD/ce84407fd5dfaad56552c04620a4745c, deployed => , replicaId => 0 } not deployed on any region server.
ERROR: There is a hole in the region chain between \x06\x00\x00\x00\x00\x00 and \x07\x00\x00\x00\x00\x00.  You need to create a new .regioninfo and region dir in hdfs to plug the hole.
ERROR: Found inconsistency in table C_PICRECORD_IDX_COLLISION
ERROR: There is a hole in the region chain between  and .  You need to create a new .regioninfo and region dir in hdfs to plug the hole.
ERROR: Found inconsistency in table C_PICRECORD
Summary:
hbase:meta is okay.
Number of regions: 1
Deployed on:  slave2,60020,1465219825115
C_PICRECORD_IDX_COLLISION is okay.
Number of regions: 14
Deployed on:  slave1,60020,1465219825663 slave2,60020,1465219825115 slave3,60020,1465219824655
SYSTEM.CATALOG is okay.
Number of regions: 1
Deployed on:  slave3,60020,1465219824655
C_PICRECORD is okay.
Number of regions: 0
Deployed on:
hbase:namespace is okay.
Number of regions: 1
Deployed on:  slave3,60020,1465219824655
SYSTEM.SEQUENCE is okay.
Number of regions: 256
Deployed on:  slave1,60020,1465219825663 slave2,60020,1465219825115 slave3,60020,1465219824655
SYSTEM.FUNCTION is okay.
Number of regions: 1
Deployed on:  slave3,60020,1465219824655
C_PICRECORD_IDX is okay.
Number of regions: 15
Deployed on:  slave1,60020,1465219825663 slave2,60020,1465219825115 slave3,60020,1465219824655
SYSTEM.STATS is okay.
Number of regions: 1
Deployed on:  slave3,60020,1465219824655
18 inconsistencies detected.
Status: INCONSISTENT


尝试使用
hbase hbck -fix
以及
hbase hbck -repair
命令来修复,结果失败

4) 通过
hdfs fsck / -delete
直接干掉坏掉的hbase corrupt blocks,然后重启hbase集群,发现region全部online,问题解决

【注意】

通过
hdfs fsck / -delete
方式删除了坏掉的hdfs block会造成数据丢失。暂时没有找到完美解决方案来修复坏掉的块,期待更高明的解决手段!
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  hadoop hdfs