-
Type: Bug
-
Resolution: Duplicate
-
Priority: Critical - P2
-
None
-
Affects Version/s: None
-
Component/s: Replication
-
Fully Compatible
-
ALL
General note: I know that the title is too general, but this is the 3rd bug I'm opening this week. We have another one coming for 3.2.1 related to sharding which we will soon publish. We are thinking of moving out of mongodb, the reliability of 3.2 is horrible!
2 bugs in this ticket:
1. We removed a member using rs.remove(). After that - the removed member (of which the log is attached) - started a versioning mess and killed itself.
filename = crash.
2. 2nd time we got the following behavior: a member selects itself, although it doesn't need to, and causes a rollback of the other member.
Our setup: primary, secondary and arbiter.
Primary: rs.stepDown() for maintenance.
Secondary takes over.
When primary is back, it starts syncing, as you can see from the logs - during this time it receives 2 "no" votes since it is still stale, but then - it receives only 1 "yes" vote (for some reason, the arbiter is quiet) - and is elected before its time. This causes a rollback on the other node.
All 3 nodes' logs are attached (primary, secondary, are). Please note the following lines:
2016-02-07T09:38:14.612+0000 I REPL [ReplicationExecutor] VoteRequester: Got no vote from in.db2m2.mydomain.com:27017 because: candidate's data is staler than mine, resp:{ term: 3, voteGranted: false, reason: "candidate's data is staler than mine", ok: 1.0 } 2016-02-07T09:38:14.612+0000 I REPL [ReplicationExecutor] VoteRequester: Got no vote from in.db2arb.mydomain.com:27017 because: candidate's data is staler than mine, resp:{ term: 3, voteGranted: false, reason: "candidate's data is staler than mine", ok: 1.0 }
and after 9 seconds, suddenly:
2016-02-07T09:38:25.613+0000 I REPL [ReplicationExecutor] VoteRequester: Got no vote from in.db2m2.mydomain.com:27017 because: candidate's data is staler than mine, resp:{ term: 3, voteGranted: false, reason: "candidate's data is staler than mine", ok: 1.0 } 2016-02-07T09:38:25.613+0000 I REPL [ReplicationExecutor] dry election run succeeded, running for election 2016-02-07T09:38:25.614+0000 I REPL [ReplicationExecutor] election succeeded, assuming primary role in term 4 2016-02-07T09:38:25.614+0000 I REPL [ReplicationExecutor] transition to PRIMARY
All members in protocol version 1. They were 0 but upgraded according to your docs ~a week ago.
- duplicates
-
SERVER-23663 New primary syncs from chosen node to catch up with timeout
- Closed
-
SERVER-18453 Avoiding Rollbacks in new Raft based election protocol
- Closed
- is related to
-
SERVER-22504 Do not blindly add self to heartbeat member data array in the TopologyCoordinator
- Closed
-
SERVER-11086 Election handoff to new primary, during stepdown
- Closed