-
Type: Bug
-
Resolution: Unresolved
-
Priority: Major - P3
-
None
-
Affects Version/s: 6.0.0
-
Component/s: None
-
Cluster Scalability
-
ALL
-
-
Cluster Scalability Priorities
-
(copied to CRM)
As seen in HELP-62301, if the ReshardingCoordinator receives an unexpected error (such as a WriteConcernFailed Error), while it is establishing the resharding participants (i.e. it is executing the _flushRoutingTableCacheUpdatesWithWriteConcern command) and it is in a state greater than "preparing-to-donate", it will try to abort itself and the participants; but the abort of the participants will hang indefinitely because it waits on its participants (which have not established their state machines and so do not undergo state transitions) to complete their state transition to Done.
max.hirschhorn@mongodb.com explained this behavior in here
(From Max: "we're getting stuck in the way resharding coordinator is written because it doesn't expect that error. the resharding coordinator expects that all shards are established as participants before it would run _shardsvrAbortReshardCollection on them but a non-recoverable error during establishing participants means that the resharding coordinator didn't establish all the participants yet tries to wait for an acknowledgment from all of themI think the general approach taken by the resharding coordinator can be revisited under the project to rewrite the resharding coordinator such that it doesn't rely on using the shard version protocol to prompt shards to make progress and instead have explicit, idempotent commands for each phase")
(Note this has been investigated on v6.0, need to verify if we see the same behavior on later versions).
- is related to
-
SERVER-86718 abortReshardCollection hangs when run for failing resharding operation as described in SERVER-86717.
- Backlog