-
Type: Bug
-
Resolution: Gone away
-
Priority: Major - P3
-
None
-
Affects Version/s: 4.4.0, 5.0.0, 5.1.0-rc0
-
Component/s: Sharding
-
ALL
-
-
Sharding EMEA 2021-10-18, Sharding EMEA 2022-02-21
-
0
Consider a shard that was running a moveChunk and had already persisted the migration recovery document. Then it stepsdown, so the new primary will need to recover the migration.
In parallel, in that same node, another moveChunk just arrived while it was still primary, but didn't yet execute past this. Now the stepdown completes and this second move chunk continues and is able to register the migration (since the first migration already unregistered from the ActiveMigrationRegistry). A new ThreadClient will be created and it will be maked as killable on stepdown. However, since the node already transitioned to secondary, it won't actually get killed.
Consider the following interleaving:
1. A shard that was running a moveChunk and had already persisted the migration recovery document. Then it stepsdown, so the new primary will need to recover the migration.
2. In parallel, in that same node, another moveChunk just arrived while it was still primary, but didn't yet execute past this.
3. The stepdown completes
4. The second move chunk continues and is able to register the migration (since the first migration already unregistered from the ActiveMigrationRegistry). A new ThreadClient will be created and it will be maked as killable on stepdown. However, since the node already transitioned to secondary, it won't actually get killed.
5. The old primary that just stepped down wins the election and becomes primary again.
6. During stepup, the primary will see that there was a migration ongoing (the one started in (1)), so it will attempt to recover it. To do so, it needs to acquire the MigrationBlockingGuard while still on drain mode. However, since the migration started in (2) managed to register on the ActiveMigrationRegistry, the MigrationBlockingGuard cannot be acquired and waits.
7. On the other side, the migration (2) is not able to make progress because the stepup has a global lock taken, so it will never be able to release the ActiveMigrationRegistry.
To fix this we should make sure that moveChunk cannot run uninterrupted on a secondary.
- is related to
-
SERVER-70127 Default system operations to be killable by stepdown
- Closed
- related to
-
SERVER-62245 MigrationRecovery must not assume that only one migration needs to be recovered
- Closed
-
SERVER-60161 Deadlock between config server stepdown and _configsvrRenameCollectionMetadata command
- Closed
-
SERVER-62296 MoveChunk should recover any unfinished migration before starting a new one
- Closed