This may be a very rare race condition, but it's worth mentioning it since it has required a lot of investigation on a failing test in a patch. It can happen if the CSRS steps down during any tests issuing 2 subsequent moveChunk commands on different ranges (e.g. here).
When a _shardsvrMoveRange command (moveChunk in previous versions) is joining an ongoing migration, it waits for the completion of the original migration that is signaled before releasing the ActiveMigrationRegistry.
As a result, the following flow could be reproduced:
- Router sends moveChunk to CSRS node A
- CSRS node A sends _shardsvrMoveRange to shard
- CSRS node A steps-down and CSRS node B steps up
- Router receives an error from CSRS node A, retries the moveChunk
- CSRS node B sends _shardsvrMoveRange to shard, joining ongoing migration
- The ongoing migration succeeds, signals completion before releasing the ActiveMigrationRegistry
- [very fast] Router receives success from CSRS node B, sends a new moveChunk for a different range
- [very fast] CSRS B forwards the new operation to the shard
- Shard replies with error because the ActiveMigrationRegistry has not been released yet (so the test fails)