Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Won't Fix
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
None

Assigned Teams:

Cluster Scalability
Operating System:
ALL
Linked BF Score:
0

It's possible that the recipient shard's invocation of _migrateClone fails with NotWritablePrimary if the test's stepdown occurs at the appropriate timing. If this happens, the chunk migration recipient's opCtx will be killed with code 6718400. When the chunk migration source later checks the progress with _recvChunkStatus, it will find that the operation has failed with Location6718400. This error will eventually bubble up to the original moveChunk command, which will not be retried because (unlike most error codes thrown during a failover during a chunk migration) Location6718400 is not a retryable error. This could eventually fail migration_coordinator_failover_include.js when it asserts that the moveChunk command was successful.

There are a number of places in chunk migrations that we return numbered error codes:

The above list may not be exhaustive. We should consider either stopping chunk migrations from replacing retryable errors with non-retryable ones, or update migration_coordinator_failover_include.js (and possible other tests?) to cope with this possibility.

Assignee:: Brett Nawrocki

Reporter:: Brett Nawrocki

Participants:: Brett Nawrocki

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: Sep 07 2024 05:24:30 AM UTC

Updated:: Sep 17 2024 05:35:42 PM UTC

Resolved:: Sep 17 2024 05:35:42 PM UTC

Details

Description

Attachments

Activity

People

Dates