-
Type: Bug
-
Resolution: Works as Designed
-
Priority: Major - P3
-
None
-
Affects Version/s: 4.4.9, 7.0.11
-
Component/s: None
-
None
-
Replication
-
ALL
-
-
Repl 2024-10-14, Repl 2024-10-28
We have observed two cases of failover on our mongo setup running v4.4.9 where majority secondaries enter rollback state. Chaining is disabled on our setup. We then attempted to reproduce this scenario on v7.0 using JS tests and believe the bug still exists.
Below is a rough sequence of events that can lead to rollback and the associated JS test is attached as a patch file . Note that we have sleeps added in the source code to help better simulate what we saw on our setup.
- Old primary is frozen - threads are not making progress.
- Meanwhile, write requests are issued to the old primary and these get stuck too.
- Election triggers by way of not seeing a progressing primary and a new primary wins the election.
- During the catch up phase on the new primary, writes from (2) unfreeze on the old primary and make their way to Oplog
- All secondaries sync these writes to their Oplog
- New primary exits catch up phase and declares ready to accept writes
- Secondaries switch sync source to new primary and realize that Oplog has diverged, enter rollback state for several minutes
- During (7), the cluster is unavailable for reads and writes rendering the cluster down
- is related to
-
SERVER-91764 Election of new primary caused all secondaries to rollback
- Closed
- related to
-
SERVER-91764 Election of new primary caused all secondaries to rollback
- Closed