Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Unresolved
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 6.0.0
Component/s: None
Labels:
- resharding-improvements

Assigned Teams:

Cluster Scalability
Operating System:
ALL
Steps To Reproduce:

Hide

A quick way to reproduce this behavior is to simulate a WriteConcernFailed error from any of the participants while executing the _flushRoutingTableCacheUpdatesWithWriteConcern command. Another way is to return a fake error status here.

Will add further details after investigating the cause of the WriteConcernError.

Show
A quick way to reproduce this behavior is to simulate a WriteConcernFailed error from any of the participants while executing the _flushRoutingTableCacheUpdatesWithWriteConcern command. Another way is to return a fake error status here . Will add further details after investigating the cause of the WriteConcernError.
Case:
Confidence Status:
None
Work Order:
3

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

As seen in HELP-62301, if the ReshardingCoordinator receives an unexpected error (such as a WriteConcernFailed Error), while it is establishing the resharding participants (i.e. it is executing the _flushRoutingTableCacheUpdatesWithWriteConcern command) and it is in a state greater than "preparing-to-donate", it will try to abort itself and the participants; but the abort of the participants will hang indefinitely because it waits on its participants (which have not established their state machines and so do not undergo state transitions) to complete their state transition to Done.

max.hirschhorn@mongodb.com explained this behavior in here

(From Max: "we're getting stuck in the way resharding coordinator is written because it doesn't expect that error. the resharding coordinator expects that all shards are established as participants before it would run _shardsvrAbortReshardCollection on them but a non-recoverable error during establishing participants means that the resharding coordinator didn't establish all the participants yet tries to wait for an acknowledgment from all of themI think the general approach taken by the resharding coordinator can be revisited under the project to rewrite the resharding coordinator such that it doesn't rely on using the shard version protocol to prompt shards to make progress and instead have explicit, idempotent commands for each phase")

(Note this has been investigated on v6.0, need to verify if we see the same behavior on later versions).

is duplicated by

SERVER-86718 abortReshardCollection hangs when run for failing resharding operation where user provided zone range includes $-prefixed fields.

Closed

is related to

SERVER-86718 abortReshardCollection hangs when run for failing resharding operation where user provided zone range includes $-prefixed fields.

Closed

Assignee:: Unassigned
Reporter:: Nandini Bhartiya
Participants:: Nandini Bhartiya
Votes:: 0 Vote for this issue
Watchers:: 12 Start watching this issue

Created:: Jul 25 2024 04:17:57 PM UTC
Updated:: Apr 01 2025 04:37:32 PM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates