-
Type: New Feature
-
Resolution: Fixed
-
Priority: Minor - P4
-
Affects Version/s: 3.4.13
-
Component/s: Networking, Replication
-
Minor Change
-
v4.4, v4.2, v4.0
-
Repl 2020-08-24, Repl 2020-09-07, Repl 2020-09-21, Repl 2020-10-05
-
(copied to CRM)
-
40
We have some dev/staging environments which are locally hosted in our office building. They are entirely for internal usage, so there uptime isn't critical. We have recently been experiencing power outages that cause all 3 members of the replica set to go down and then when power restores come back up at the same time. This has happened about 5 times now and each time when the replica set comes back up both the primary/secondary end up in the REMOVED status and never recover unless we manually restart one of the mongo processes.
mongo-dev1 rs.status()
{ "state" : 10, "stateStr" : "REMOVED", "uptime" : 199841, "optime" : { "ts" : Timestamp(1529137449, 1), "t" : NumberLong(590) }, "optimeDate" : ISODate("2018-06-16T08:24:09Z"), "ok" : 0, "errmsg" : "Our replica set config is invalid or we are not a member of it", "code" : 93, "codeName" : "InvalidReplicaSetConfig" }
mongo-dev2 rs.status()
{ "state" : 10, "stateStr" : "REMOVED", "uptime" : 199879, "optime" : { "ts" : Timestamp(1529137449, 1), "t" : NumberLong(590) }, "optimeDate" : ISODate("2018-06-16T08:24:09Z"), "ok" : 0, "errmsg" : "Our replica set config is invalid or we are not a member of it", "code" : 93, "codeName" : "InvalidReplicaSetConfig" }
mongo-dev1 show log rs
2018-06-16T09:10:25.236+0000 I REPL [replExecDBWorker-0] New replica set config in use: { _id: "dev_cluster1", version: 140719, protocolVersion: 1, members: [ { _id: 0, host: "mongo-dev1.220office.local:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 2.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 3, host: "utility-dev1.220office.local:27017", arbiterOnly: true, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 4, host: "mongo-dev2.2 2018-06-16T09:10:25.236+0000 I REPL [replExecDBWorker-0] transition to REMOVED 2018-06-17T21:28:09.139+0000 I REPL [ReplicationExecutor] Member utility-dev1.220office.local:27017 is now in state ARBITER
mongo-dev1 rs.conf()
{ "_id" : "dev_cluster1", "version" : 140719, "protocolVersion" : NumberLong(1), "members" : [ { "_id" : 0, "host" : "mongo-dev1.220office.local:27017", "arbiterOnly" : false, "buildIndexes" : true, "hidden" : false, "priority" : 2, "tags" : { }, "slaveDelay" : NumberLong(0), "votes" : 1 }, { "_id" : 3, "host" : "utility-dev1.220office.local:27017", "arbiterOnly" : true, "buildIndexes" : true, "hidden" : false, "priority" : 1, "tags" : { }, "slaveDelay" : NumberLong(0), "votes" : 1 }, { "_id" : 4, "host" : "mongo-dev2.220office.local:27017", "arbiterOnly" : false, "buildIndexes" : true, "hidden" : false, "priority" : 1, "tags" : { }, "slaveDelay" : NumberLong(0), "votes" : 1 } ], "settings" : { "chainingAllowed" : true, "heartbeatIntervalMillis" : 2000, "heartbeatTimeoutSecs" : 10, "electionTimeoutMillis" : 10000, "catchUpTimeoutMillis" : 60000, "getLastErrorModes" : { }, "getLastErrorDefaults" : { "w" : 1, "wtimeout" : 0 } } }
As I read the documentation I can't find much information about the REMOVED status. In our setup, since mongo-dev1 has a priority of 2 and mongo-dev2 has a priority of 1 I would expect mongo-dev1 to be elected as the primary after the reboot.
Is this a bug or are we doing something wrong. If it's a bug, what is the proper procedure for this when all members of the replica set come online at the same time?
- is related to
-
SERVER-62699 Replica set fails to restart after shutdown of all Nodes in a Dynamic DNS/network environment
- Backlog
-
SERVER-48178 Finding self in reconfig may be interrupted by closing connections due to rollback
- Closed
-
SERVER-51163 Mark nodes returning InvalidReplicaSetConfig in heartbeats as down
- Closed
-
SERVER-40159 Add retry logic for name resolution failure in isSelf
- Closed
- related to
-
SERVER-41031 After an unreachable node is added and removed from the replica set, the other replica set members continue to send heartbeat to this removed node
- Open
-
SERVER-54121 Single replica set node removed due to isSelf cannot re-attempt to find itself
- Open
-
SERVER-62699 Replica set fails to restart after shutdown of all Nodes in a Dynamic DNS/network environment
- Backlog
-
SERVER-48480 Abort initial sync upon transition to REMOVED state
- Closed
-
SERVER-40159 Add retry logic for name resolution failure in isSelf
- Closed