-
Type: Bug
-
Resolution: Won't Fix
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
Cluster Scalability
-
ALL
-
0
It's possible that the recipient shard's invocation of _migrateClone fails with NotWritablePrimary if the test's stepdown occurs at the appropriate timing. If this happens, the chunk migration recipient's opCtx will be killed with code 6718400. When the chunk migration source later checks the progress with _recvChunkStatus, it will find that the operation has failed with Location6718400. This error will eventually bubble up to the original moveChunk command, which will not be retried because (unlike most error codes thrown during a failover during a chunk migration) Location6718400 is not a retryable error. This could eventually fail migration_coordinator_failover_include.js when it asserts that the moveChunk command was successful.
There are a number of places in chunk migrations that we return numbered error codes:
The above list may not be exhaustive. We should consider either stopping chunk migrations from replacing retryable errors with non-retryable ones, or update migration_coordinator_failover_include.js (and possible other tests?) to cope with this possibility.