ISSUE DESCRIPTION AND IMPACT
This issue can cause unavailability of a shard in sharded clusters running MongoDB versions 5.0.0 - 5.0.5 and 5.1.0 - 5.1.1. Next versions are not affected.
The problem can potentially occur if all of the following conditions have been met at least once:
- More than one sharded collection
- Multiple migrations
- Intense write workloads or hardware failures
Symptom of the bug: mongod process crashing upon step-up due to an invariant failure with the following message: "Upon step-up a second migration coordinator was found".
REMEDIATION AND WORKAROUNDS
- Restart nodes of the shard as replica set
- Double-check that at most one migration coordinator document does not have a definitive decision.
- For each migration coordinator document with a definitive decision, double-check that range deletion tasks are consistent with migration coordinators (same range and collectionUUID, if present):
- Aborted decision:
— No range deletion document on donor
— Zero or one ready range deletion document on recipient - Committed decision:
— Zero or one ready range deletion document on donor
— No range deletion document on recipient - No decision:
— One pending range deletion tasks on donor
— One pending range deletion tasks on recipient
- Aborted decision:
- Majority-delete all migration coordinators with a definitive decision
- Restart nodes as shard
TECHNICAL DETAILS
Migration coordinators:
- Documents persisted locally on shards in the internal collection config.migrationCoordinators
- The structure of migration coordinator documents can be found here.
Range deletion tasks:
- Documents persisted locally on shards in the internal collection config.rangeDeletions
- The structure of range deletion task documents can be found here
--- Original ticket description ---
There are several situations that can lead to more than one migration (for different collections) needing recovery on stepup. For example, when a migration fails here we only clear the collection's filtering metadata so that the next access to the collection will trigger the recovery, and then release the ActiveMigrationRegistry. At this point, nothing prevents a migration to a different collection from starting, so now if the shard stepped down it would have two migrations to recover.
This invariant along with taking the MigrationBlockingGuard on stepup migration recovery was added on SERVER-50174. It was meant to prevent migrations to different collections before the unfinished migrations found on stepup are recovered. However, as described above, situations where there are multiple migrations pending recovery are still possible in non-stepping situations.
The fact that a different migration (to another collection) starts using the same lsid as the migration pending recovery should not be a problem. The new migration will use a txnNumber that is two more than the previous migration. This will effectively be the same as advancing the txn number: It will prevent the first migration from using its original (lsid, txnNumber pair). The fact that a recovering migration gets a TransactionTooOld error when advancing the txnNumber on the recipient is not fully safe to ignore, because TransactionTooOld does not guarantee that a rollback can't occur, after which the original txnNumber could still be valid.
This ticket will provide a fix so that clusters that are already in the faulty situation of having several migrations pending to be recover don't hit the invariant on stepup anymore. SERVER-62296 will avoid this faulty situation from happening again.
- is caused by
-
SERVER-50174 Multiple concurrent migration recoveries after step-up can race for the fixed Lsid/TxnNumber
- Closed
- is related to
-
SERVER-60521 Deadlock on stepup due to moveChunk command running uninterrupted on secondary
- Closed
-
SERVER-62213 Investigate presence of multiple migration coordinator documents
- Closed
-
SERVER-62243 Wait for vector clock document majority-commit without timeout
- Closed
-
SERVER-62316 Remove the workaround for SERVER-62245 once 6.0 branches out
- Closed
- related to
-
SERVER-62296 MoveChunk should recover any unfinished migration before starting a new one
- Closed