If a primary failover happens during movePrimary operation, we could miss to clear database metadata on the original primary node of the coordiantor shard, leading to possible data loss.
As part of movePrimary coordinator, database metadata on primary node is explicitly cleared in kCommit phase, while on secondary nodes metadata is cleared indirectly when we exit the database recoverable critical section in kExitCriticalSection phase.
If a step-down happens between these two phases and a new primary node is elected on the coordinator shard we could miss clearing metadata on the new primary.
Consider the following scenario:
- kCommit
- N1 (primary) -> db metadata cleared
- N2 (secondary) -> db metadata not cleared
- kExitCriticalSection
- N1 (secondary) -> db metadata cleared
- N2 (primary) -> db metadata not cleared
- is caused by
-
SERVER-71308 Enable featureFlag for resilient movePrimary
- Closed