ceph (luminous 版) data disk 故障测试
2017-11-24 18:31
330 查看
目的
模拟 ceph (luminous 版) data disk 故障 修复上述问题
环境
参考手动部署 ceph 环境说明 (luminous 版)参考当前 ceph 环境
ceph -s
cluster: id: c45b752d-5d4d-4d3a-a3b2-04e73eff4ccd health: HEALTH_OK services: mon: 3 daemons, quorum hh-ceph-128040,hh-ceph-128214,hh-ceph-128215 mgr: openstack(active) osd: 36 osds: 36 up, 36 in data: pools: 1 pools, 2048 pgs objects: 28024 objects, 109 GB usage: 331 GB used, 196 TB / 196 TB avail pgs: 2048 active+clean
osd tree (取部分)
[root@hh-ceph-128214 ceph]# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 216.00000 root default -10 72.00000 rack racka07 -3 72.00000 host hh-ceph-128214 12 hdd 6.00000 osd.12 up 1.00000 1.00000 13 hdd 6.00000 osd.13 up 1.00000 1.00000 14 hdd 6.00000 osd.14 up 1.00000 1.00000 15 hdd 6.00000 osd.15 up 1.00000 1.00000 16 hdd 6.00000 osd.16 up 1.00000 1.00000 17 hdd 6.00000 osd.17 up 1.00000 1.00000 18 hdd 6.00000 osd.18 up 1.00000 1.00000 19 hdd 6.00000 osd.19 up 1.00000 1.00000 20 hdd 6.00000 osd.20 up 1.00000 1.00000 21 hdd 6.00000 osd.21 up 1.00000 1.00000 22 hdd 6.00000 osd.22 up 1.00000 1.00000 23 hdd 6.00000 osd.23 up 1.00000 1.00000 -9 72.00000 rack racka12 -2 72.00000 host hh-ceph-128040 0 hdd 6.00000 osd.0 up 1.00000 0.50000 1 hdd 6.00000 osd.1 up 1.00000 1.00000 2 hdd 6.00000 osd.2 up 1.00000 1.00000 3 hdd 6.00000 osd.3 up 1.00000 1.00000
故障模拟
[root@hh-ceph-128214 ceph]# df -h | grep ceph-14 /dev/sdc1 5.5T 8.8G 5.5T 1% /var/lib/ceph/osd/ceph-14 /dev/sdn3 4.7G 2.1G 2.7G 44% /var/lib/ceph/journal/ceph-14 [root@hh-ceph-128214 ceph]# rm -rf /var/lib/ceph/osd/ceph-14/* [root@hh-ceph-128214 ceph]# ls /var/lib/ceph/osd/ceph-14/
查询当前状态
cluster: id: c45b752d-5d4d-4d3a-a3b2-04e73eff4ccd health: HEALTH_WARN 1 osds down Degraded data redundancy: 3246/121608 objects degraded (2.669%), 124 pgs unclean, 155 pgs degraded services: mon: 3 daemons, quorum hh-ceph-128040,hh-ceph-128214,hh-ceph-128215 mgr: openstack(active) osd: 36 osds: 35 up, 36 in data: pools: 1 pools, 2048 pgs objects: 40536 objects, 157 GB usage: 493 GB used, 195 TB / 196 TB avail pgs: 3246/121608 objects degraded (2.669%) 1893 active+clean 155 active+undersized+degraded io: client: 132 kB/s rd, 177 MB/s wr, 165 op/s rd, 175 op/s wr
参考 osd tree
[root@hh-ceph-128214 ceph]# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 216.00000 root default -10 72.00000 rack racka07 -3 72.00000 host hh-ceph-128214 12 hdd 6.00000 osd.12 up 1.00000 1.00000 13 hdd 6.00000 osd.13 up 1.00000 1.00000 14 hdd 6.00000 osd.14 down 1.00000 1.00000 15 hdd 6.00000 osd.15 up 1.00000 1.00000 16 hdd 6.00000 osd.16 up 1.00000 1.00000 17 hdd 6.00000 osd.17 up 1.00000 1.00000 18 hdd 6.00000 osd.18 up 1.00000 1.00000 19 hdd 6.00000 osd.19 up 1.00000 1.00000 20 hdd 6.00000 osd.20 up 1.00000 1.00000 21 hdd 6.00000 osd.21 up 1.00000 1.00000 22 hdd 6.00000 osd.22 up 1.00000 1.00000 23 hdd 6.00000 osd.23 up 1.00000 1.00000 -9 72.00000 rack racka12 -2 72.00000 host hh-ceph-128040 0 hdd 6.00000 osd.0 up 1.00000 0.50000 1 hdd 6.00000 osd.1 up 1.00000 1.00000
参考错误日志
orting failure:1 2017-11-24 16:09:24.767761 7fdd215c1700 0 log_channel(cluster) log [DBG] : osd.14 10.199.128.214:6804/11943 reported immediately failed by osd.10 10.199.128.40:6820/12617 2017-11-24 16:09:24.996514 7fdd215c1700 1 mon.hh-ceph-128040@0(leader).osd e328 prepare_failure osd.14 10.199.128.214:6804/11943 from osd.6 10.199.128.40:6812/12317 is reporting failure:1 2017-11-24 16:09:24.996545 7fdd215c1700 0 log_channel(cluster) log [DBG] : osd.14 10.199.128.214:6804/11943 reported immediately failed by osd.6 10.199.128.40:6812/12317 2017-11-24 16:09:25.083523 7fdd23dc6700 0 log_channel(cluster) log [WRN] : Health check failed: 1 osds down (OSD_DOWN) 2017-11-24 16:09:25.087241 7fdd1cdb8700 1 mon.hh-ceph-128040@0(leader).log v17642 check_sub sending message to client.94503 10.199.128.40:0/161437639 with 1 entries (version 17642) 2017-11-24 16:09:25.093344 7fdd1cdb8700 1 mon.hh-ceph-128040@0(leader).osd e329 e329: 36 total, 35 up, 36 in 2017-11-24 16:09:25.093857 7fdd1cdb8700 0 log_channel(cluster) log [DBG] : osdmap e329: 36 total, 35 up, 36 in 2017-11-24 16:09:25.094151 7fdd215c1700 0 mon.hh-ceph-128040@0(leader) e1 handle_command mon_command({"prefix": "osd metadata", "id": 30} v 0) v1 2017-11-24 16:09:25.094192 7fdd215c1700 0 log_channel(audit) log [DBG] : from='client.94503 10.199.128.40:0/161437639' entity='mgr.openstack' cmd=[{"prefix": "osd metadata", "id": 30}]: dispatch
恢复过程
删除 osd.14 auth 授权[root@hh-ceph-128040 tmp]# ceph auth del osd.14 updated
移除 osd.14 osd map
[root@hh-ceph-128214 ~]# ceph osd crush remove osd.14 removed item id 14 name 'osd.14' from crush map
移除 OSD.14
[root@hh-ceph-128214 ~]# ceph osd rm osd.14 removed osd.14
参考osd tree
Every 2.0s: ceph osd tree Sat Nov 25 15:27:41 2017 ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 210.00000 root default -10 66.00000 rack racka07 -3 66.00000 host hh-ceph-128214 12 hdd 6.00000 osd.12 up 1.00000 1.00000 13 hdd 6.00000 osd.13 up 1.00000 1.00000 15 hdd 6.00000 osd.15 up 1.00000 1.00000 16 hdd 6.00000 osd.16 up 1.00000 1.00000 17 hdd 6.00000 osd.17 up 1.00000 1.00000 18 hdd 6.00000 osd.18 up 1.00000 1.00000 19 hdd 6.00000 osd.19 up 1.00000 1.00000 20 hdd 6.00000 osd.20 up 1.00000 1.00000 21 hdd 6.00000 osd.21 up 1.00000 1.00000 22 hdd 6.00000 osd.22 up 1.00000 1.00000 23 hdd 6.00000 osd.23 up 1.00000 1.00000
删除 journal 文件
[root@hh-ceph-128214 ceph]# rm -rf /var/lib/ceph/journal/ceph-14/journal [root@hh-ceph-128214 /]# umount /dev/sdn3 [root@hh-ceph-128214 /]# mkfs -t xfs -f /dev/sdn3 meta-data=/dev/sdn3 isize=256 agcount=4, agsize=305152 blks = sectsz=4096 attr=2, projid32bit=1 = crc=0 finobt=0 data = bsize=4096 blocks=1220608, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=0 log =internal log bsize=4096 blocks=2560, version=2 = sectsz=4096 sunit=1 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 [root@hh-ceph-128214 ~]# mount /dev/sdn3 /var/lib/ceph/journal/ceph-14/
恢复分区
[root@hh-ceph-128214 tmp]# umount /dev/sdc1 [root@hh-ceph-128214 /]# dd if=/dev/zero of=/dev/sdc bs=1M count=100 记录了100+0 的读入 记录了100+0 的写出 104857600字节(105 MB)已复制,0.59539 秒,176 MB/秒 [root@hh-ceph-128214 tmp]# parted -s /dev/sdc mklabel gpt [root@hh-ceph-128214 tmp]# parted /dev/sdc mkpart primary xfs 1 100% 信息: You may need to update /etc/fstab. [root@hh-ceph-128214 tmp]# mkfs.xfs -f -i size=1024 /dev/sdc1 meta-data=/dev/sdc1 isize=1024 agcount=6, agsize=268435455 blks = sectsz=4096 attr=2, projid32bit=1 = crc=0 finobt=0 data = bsize=4096 blocks=1465130240, imaxpct=5 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=0 log =internal log bsize=4096 blocks=521728, version=2 = sectsz=4096 sunit=1 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 [root@hh-ceph-128214 tmp]# mount /dev/sdc1 /var/lib/ceph/osd/ceph-14/
初始化 ceph osd (自动恢复 journal 文件)
[root@hh-ceph-128214 /]# ceph-osd -i 14 --mkfs --mkkey 2017-11-24 18:21:42.297329 7fc7dc79bd00 -1 journal FileJournal::_open: disabling aio for non-block journal. Use journal_force_aio to force use of aio anyway 2017-11-24 18:21:42.473203 7fc7dc79bd00 -1 journal FileJournal::_open: disabling aio for non-block journal. Use journal_force_aio to force use of aio anyway 2017-11-24 18:21:42.473725 7fc7dc79bd00 -1 read_settings error reading settings: (2) No such file or directory 2017-11-24 18:21:42.782000 7fc7dc79bd00 -1 created object store /var/lib/ceph/osd/ceph-14 for osd.14 fsid c45b752d-5d4d-4d3a-a3b2-04e73eff4ccd 2017-11-24 18:21:42.782044 7fc7dc79bd00 -1 auth: error reading file: /var/lib/ceph/osd/ceph-14/keyring: can't open /var/lib/ceph/osd/ceph-14/keyring: (2) No such file or directory 2017-11-24 18:21:42.782202 7fc7dc79bd00 -1 created new key in keyring /var/lib/ceph/osd/ceph-14/keyring
创建 osd
[root@hh-ceph-128214 ~]# ceph osd create 14
恢复 auth 认证
[root@hh-ceph-128214 tmp]# ceph auth add osd.14 osd 'allow *' mon 'allow profile osd' -i /var/lib/ceph/osd/ceph-14/keyring added key for osd.14
恢复文件权限
[root@hh-ceph-128214 /]# ls -l /var/lib/ceph/journal/ceph-14/ /var/lib/ceph/osd/ceph-14/ /var/lib/ceph/journal/ceph-14/: 总用量 2097152 -rw-r--r-- 1 root root 2147483648 11月 24 18:21 journal /var/lib/ceph/osd/ceph-14/: 总用量 36 -rw-r--r-- 1 root root 37 11月 24 18:21 ceph_fsid drwxr-xr-x 4 root root 61 11月 24 18:21 current -rw-r--r-- 1 root root 37 11月 24 18:21 fsid -rw------- 1 root root 57 11月 24 18:21 keyring -rw-r--r-- 1 root root 21 11月 24 18:21 magic -rw-r--r-- 1 root root 6 11月 24 18:21 ready -rw-r--r-- 1 root root 4 11月 24 18:21 store_version -rw-r--r-- 1 root root 53 11月 24 18:21 superblock -rw-r--r-- 1 root root 10 11月 24 18:21 type -rw-r--r-- 1 root root 3 11月 24 18:21 whoami [root@hh-ceph-128214 /]# chown ceph:ceph -R /var/lib/ceph/journal/ceph-14/ /var/lib/ceph/osd/ceph-14/
启动 ceph osd
[root@hh-ceph-128214 tmp]# systemctl status ceph-osd@14 ● ceph-osd@14.service - Ceph object storage daemon osd.14 Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; disabled; vendor preset: disabled) Active: failed (Result: start-limit) since 五 2017-11-24 17:35:00 CST; 1min 51s ago Process: 106773 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id %i --setuser ceph --setgroup ceph (code=exited, status=1/FAILURE) Process: 106767 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id %i (code=exited, status=0/SUCCESS) Main PID: 106773 (code=exited, status=1/FAILURE) 11月 24 17:34:40 hh-ceph-128214.vclound.com systemd[1]: Unit ceph-osd@14.service entered failed state. 11月 24 17:34:40 hh-ceph-128214.vclound.com systemd[1]: ceph-osd@14.service failed. 11月 24 17:35:00 hh-ceph-128214.vclound.com systemd[1]: ceph-osd@14.service holdoff time over, scheduling restart. 11月 24 17:35:00 hh-ceph-128214.vclound.com systemd[1]: start request repeated too quickly for ceph-osd@14.service 11月 24 17:35:00 hh-ceph-128214.vclound.com systemd[1]: Failed to start Ceph object storage daemon osd.14. 11月 24 17:35:00 hh-ceph-128214.vclound.com systemd[1]: Unit ceph-osd@14.service entered failed state. 11月 24 17:35:00 hh-ceph-128214.vclound.com systemd[1]: ceph-osd@14.service failed. [root@hh-ceph-128214 tmp]# systemctl start ceph-osd@14 Job for ceph-osd@14.service failed because start of the service was attempted too often. See "systemctl status ceph-osd@14.service" and "journalctl -xe" for details. To force a start use "systemctl reset-failed ceph-osd@14.service" followed by "systemctl start ceph-osd@14.service" again. [root@hh-ceph-128214 tmp]# systemctl reset-failed ceph-osd@14 [root@hh-ceph-128214 tmp]# systemctl start ceph-osd@14 [root@hh-ceph-128214 tmp]# systemctl status ceph-osd@14 ● ceph-osd@14.service - Ceph object storage daemon osd.14 Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; disabled; vendor preset: disabled) Active: active (running) since 五 2017-11-24 17:37:17 CST; 3s ago Process: 106871 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id %i (code=exited, status=0/SUCCESS) Main PID: 106877 (ceph-osd) CGroup: /system.slice/system-ceph\x2dosd.slice/ceph-osd@14.service └─106877 /usr/bin/ceph-osd -f --cluster ceph --id 14 --setuser ceph --setgroup ceph 11月 24 17:37:17 hh-ceph-128214.vclound.com systemd[1]: Starting Ceph object storage daemon osd.14... 11月 24 17:37:17 hh-ceph-128214.vclound.com systemd[1]: Started Ceph object storage daemon osd.14. 11月 24 17:37:17 hh-ceph-128214.vclound.com ceph-osd[106877]: starting osd.14 at - osd_data /var/lib/ceph/osd/ceph-14 /var/lib/ceph/journal/ceph-14/journal 11月 24 17:37:18 hh-ceph-128214.vclound.com ceph-osd[106877]: 2017-11-24 17:37:18.035052 7fbaaf369d00 -1 journal FileJournal::_open: disabling aio for non-block ...o anyway 11月 24 17:37:18 hh-ceph-128214.vclound.com ceph-osd[106877]: 2017-11-24 17:37:18.047920 7fbaaf369d00 -1 osd.14 0 log_to_monitors {default=true} 11月 24 17:37:18 hh-ceph-128214.vclound.com ceph-osd[106877]: 2017-11-24 17:37:18.054256 7fba96117700 -1 osd.14 0 waiting for initial osdmap Hint: Some lines were ellipsized, use -l to show in full.
检测
参考当前 ceph 状态cluster: id: c45b752d-5d4d-4d3a-a3b2-04e73eff4ccd health: HEALTH_WARN Degraded data redundancy: 8965/137559 objects degraded (6.517%), 60 pgs unclean, 206 pgs degraded services: mon: 3 daemons, quorum hh-ceph-128040,hh-ceph-128214,hh-ceph-128215 mgr: openstack(active) osd: 36 osds: 36 up, 36 in <- 参考这里 data: pools: 1 pools, 2048 pgs objects: 45853 objects, 178 GB usage: 540 GB used, 195 TB / 196 TB avail pgs: 8965/137559 objects degraded (6.517%) 1842 active+clean 201 active+recovery_wait+degraded 5 active+recovering+degraded io: recovery: 168 MB/s, 42 objects/s
参考 osd tree
[root@hh-ceph-128214 ceph]# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 215.45609 root default -10 71.45609 rack racka07 -3 71.45609 host hh-ceph-128214 12 hdd 6.00000 osd.12 up 1.00000 1.00000 13 hdd 6.00000 osd.13 up 1.00000 1.00000 14 hdd 5.45609 osd.14 up 1.00000 1.00000 15 hdd 6.00000 osd.15 up 1.00000 1.00000 16 hdd 6.00000 osd.16 up 1.00000 1.00000 17 hdd 6.00000 osd.17 up 1.00000 1.00000 18 hdd 6.00000 osd.18 up 1.00000 1.00000 19 hdd 6.00000 osd.19 up 1.00000 1.00000 20 hdd 6.00000 osd.20 up 1.00000 1.00000 21 hdd 6.00000 osd.21 up 1.00000 1.00000 22 hdd 6.00000 osd.22 up 1.00000 1.00000 23 hdd 6.00000 osd.23 up 1.00000 1.00000 -9 72.00000 rack racka12 -2 72.00000 host hh-ceph-128040 0 hdd 6.00000 osd.0 up 1.00000 0.50000 1 hdd 6.00000 osd.1 up 1.00000 1.00000 2 hdd 6.00000 osd.2 up 1.00000 1.00000 3 hdd 6.00000 osd.3 up 1.00000 1.00000
总结
在恢复 data disk 时候, 必须要把故障 osd 移除, (ceph osd rm osd.14) 之前 ceph 0.87 版本在恢复时候不需要执行这个步骤
相关文章推荐
- ceph (luminous 版) journal disk 故障测试
- TFS dataserver故障测试
- ceph - 故障测试 [目标: 延时自动执行 RECOVERY]
- ceph (luminous 版) 故障修复记录
- ceph 可靠性测试 单故障域故障测试、单磁盘故障测试、单节点故障测试、单机柜故障测试、故障数据重构测试
- ceph 故障解决备忘
- ElasticSearch的Sample Data,即测试数据
- linux学习入门16——LINUX网络配置(linuxcast.net)(以太网配置,网络测试,网络故障排查等)
- android真机测试,在eclipse中无法打开data文件夹
- Ceph v12.1.0 Luminous RC released
- [转载]软件功能性测试21个故障模型
- 深入理解ceph-disk的工作机制
- 关于使用xcode7进行真机测试出现 "could not find developer disk image"
- ceph luminous 新功能之内置dashboard 之 mgr功能模块配置
- Python自动测试打开chrome时,chrome地址栏总是出现data:;
- Ceph对象存储测试中遇到的bucket名称的大小写问题
- xcode真机测试could not find developer disk image解决办法
- 运行基准测试hadoop集群中的问题:org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /benchmarks/TestDFSIO/io_data/test_
- 真机测试的时候出现 data parameter is nil
- Hyper-V虚拟化测试20自动故障转移