-
Type:
Improvement
-
Resolution: Won't Fix
-
Priority:
Trivial - P5
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
Server Programmability
Background
In a recent test run for mongosync’s 8.0 support, a `moveCollection` command returned InterruptedAtShutdown, which my code ignored. Mongosync then ran, after which the test ran a consistency checker. That checker found a missing document on the destination.
What apparently happened was: the `moveCollection` actually succeeded despite the `ok: 0` response. That success included changing the collection’s UUID; that UUID change happened after mongosync had done its initialization (which records collection UUIDs). Thus, when mongosync fetched the collection’s documents—which request includes the collection’s pre-`moveCollection` UUID—it got a response from the server that indicated the collection UUID no longer existed. Thus, my test migration lost a document.
(This also happened because mongosync was ignoring unrecognized DDL events; REP-5614 is remedying that, so the `reshardCollection` event that fires on successful `moveCollection` will crash mongosync.)
Proposal
The server’s error responses should indicate explicitly whether an error means the system state is indeterminate:
1. Ideally, clients/drivers could opt-in to some sort of ok: "?" response that discourages interpretation of errors like InterruptedAtShutdown as failures. (For clients that don’t indicate support for this response in their requests, the server would still send ok: 0.
2. Failing that—or perhaps in addition to it—the server could send a supplementary error label like SystemStateUnknown.
3. The error message should be prefixed with System state unknown:. (I don’t know if this would violate stability guarantees.)
4. For existing versions (where a potential backport isn’t received), we should publish a list of error codes that indicate an indeterminate system state. Or, if it’s not as simple as checking membership in a static set of codes, we should publish whatever algorithm will guide users to a correct interpretation of the responses. (NB: Per max.hirschhorn@mongodb.com, this function is more or less that set of error codes.)
- related to
-
SERVER-69295 Mongos NoWritesPerformed errors MUST return original error
-
- Backlog
-