您的位置：首页 > 数据库 > Mongodb

MongoDB 副本集成员节点 RECOVERING 状态处理

2018-07-17 22:19 851 查看

这两天遇到好几个MongoDB集群故障，其中一种就是节点长期处于 RECOVERING 状态，并且不能主动追上 primary 节点，需要手动干预。
首先 rs.status()查看实例状态，发现有的节点处于 RECOVERING 状态。

查看此节点 log 发现如下报错：

2018-07-17T19:04:27.343+0800 I REPL     [ReplicationExecutor] syncing from: 10.204.11.48:9303
2018-07-17T19:04:27.347+0800 W REPL     [rsBackgroundSync] we are too stale to use 10.204.11.48:9303 as a sync source
2018-07-17T19:04:27.347+0800 I REPL     [ReplicationExecutor] could not find member to sync from
2018-07-17T19:04:27.347+0800 E REPL     [rsBackgroundSync] too stale to catch up -- entering maintenance mode
2018-07-17T19:04:27.347+0800 I REPL     [rsBackgroundSync] our last optime : (term: -1, timestamp: Jul 16 11:55:17:103bb)
2018-07-17T19:04:27.347+0800 I REPL     [rsBackgroundSync] oldest available is (term: -1, timestamp: Jul 17 12:49:36:9ffb)
2018-07-17T19:04:27.347+0800 I REPL     [rsBackgroundSync] See http://dochub.mongodb.org/core/resyncingaverystalereplicasetmember 2018-07-17T19:04:27.347+0800 I REPL     [ReplicationExecutor] going into maintenance mode with 1856 other maintenance mode tasks in progress

显然节点脱离集群时间过长，已经不能同其他节点同步。这种情况下可以通过两种方式将节点重新加入集群。
第一种方法:Automatically Sync a Member
这种方式比较简单，先关闭阶段，清空 data 目录，重启节点。然后就会自动重新同步
具体: a. 关闭节点 db.shutdownServer()
b. 清空data目录 mv data data_old ;mkdir data
c.启动节点 mongod -f /etc/mongodb9303.cnf
如下是启动之后开始同步的日志。

2018-07-17T21:38:01.131+0800 I REPL     [ReplicationExecutor] This node is 10.204.11.50:9303 in the config
2018-07-17T21:38:01.131+0800 I REPL     [ReplicationExecutor] transition to STARTUP2
2018-07-17T21:38:01.131+0800 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to 10.204.11.48:9303
2018-07-17T21:38:01.132+0800 I REPL     [ReplicationExecutor] Member 10.204.11.70:9303 is now in state ARBITER
2018-07-17T21:38:01.134+0800 I ASIO     [NetworkInterfaceASIO-Replication-0] Successfully connected to 10.204.11.48:9303, took 3ms (1 connections now open to 10.204.11.48:9303)
2018-07-17T21:38:01.134+0800 I REPL     [ReplicationExecutor] Member 10.204.11.48:9303 is now in state PRIMARY
2018-07-17T21:38:01.572+0800 I NETWORK  [initandlisten] connection accepted from 10.204.11.48:59332 #2 (2 connections now open)
2018-07-17T21:38:01.587+0800 I ACCESS   [conn2] Successfully authenticated as principal __system on local
2018-07-17T21:38:02.131+0800 I REPL     [rsSync] ******
2018-07-17T21:38:02.131+0800 I REPL     [rsSync] creating replication oplog of size: 20480MB...
2018-07-17T21:38:02.133+0800 I STORAGE  [rsSync] Starting WiredTigerRecordStoreThread local.oplog.rs
2018-07-17T21:38:02.133+0800 I STORAGE  [rsSync] The size storer reports that the oplog contains 0 records totaling to 0 bytes
2018-07-17T21:38:02.133+0800 I STORAGE  [rsSync] Scanning the oplog to determine where to place markers for truncation
2018-07-17T21:38:02.137+0800 I REPL     [rsSync] ******
2018-07-17T21:38:02.137+0800 I REPL     [rsSync] initial sync pending
2018-07-17T21:38:02.140+0800 I REPL     [rsSync] no valid sync sources found in current replset to do an initial sync
2018-07-17T21:38:02.968+0800 I NETWORK  [initandlisten] connection accepted from 10.204.11.50:54890 #3 (3 connections now open)
2018-07-17T21:38:03.140+0800 I REPL     [rsSync] initial sync pending
2018-07-17T21:38:03.140+0800 I REPL     [ReplicationExecutor] syncing from: 10.204.11.48:9303
2018-07-17T21:38:03.144+0800 I REPL     [rsSync] initial sync drop all databases
2018-07-17T21:38:03.144+0800 I STORAGE  [rsSync] dropAllDatabasesExceptLocal 1
2018-07-17T21:38:03.144+0800 I REPL     [rsSync] initial sync clone all databases
2018-07-17T21:38:03.304+0800 I REPL     [rsSync] fetching and creating collections for admin
2018-07-17T21:38:03.306+0800 I REPL     [rsSync] fetching and creating collections for dmp_edata_leju_com
2018-07-17T21:38:05.698+0800 I REPL     [rsSync] fetching and creating collections for test
2018-07-17T21:38:05.699+0800 I REPL     [rsSync] initial sync cloning db: admin
2018-07-17T21:38:05.707+0800 I INDEX    [rsSync] build index on: admin.system.users properties: { v: 1, key: { _id: 1 }, name: "_id_", ns: "admin.system.users" }

rs.status() 会发现状态变成了 STARTUP2 ，并且 data 目录在不断增加。

{
"_id" : 3,
"name" : "10.204.11.50:9303",
"health" : 1,
"state" : 5,
"stateStr" : "STARTUP2",
"uptime" : 1586,
"optime" : Timestamp(0, 0),
"optimeDate" : ISODate("1970-01-01T00:00:00Z"),
"syncingTo" : "10.204.11.48:9303",
"configVersion" : 88886,
"self" : true
},

也可以通过 db.printSlaveReplicationInfo( ) 看出同步进度。

PRIMARY> db.printSlaveReplicationInfo( )
source: 10.204.11.50:9303
syncedTo: Thu Jan 01 1970 08:00:00 GMT+0800 (CST)
1531815275 secs (425504.24 hrs) behind the primary

第二种方法:Sync by Copying Data Files from Another Member
此方法是通过拷贝其他节点数据的方法实现，前提是集群中能找到多余的 secondary 节点，将那个 secondary 节点停机，然后把它的 data 目录传输到待修复的节点，启动待修复节点，完成同步。相较于第一种方法，这个方法速度更快。
关于重新同步集群成员的问题，如下是官方的详尽解释。
https://docs.mongodb.com/manual/tutorial/resync-replica-set-member/?spm=a2c4e.11153940.blogcont426357.5.6c78424fIPyht1

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： MongoDB

相关文章推荐

新的分享

章节导航

MongoDB 副本集 成员节点 RECOVERING 状态处理

MongoDB 副本集成员节点 RECOVERING 状态处理