Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 5.0.3, 5.1.0-rc0
Affects Version/s: None
Component/s: Sharding
Labels:
- PM-234-M3
- PM-234-T-lifecycle

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Backport Requested:

v5.0
Sprint:
Sharding 2021-07-12
Linked BF Score:
134
Story Points:
2
Confidence Status:
None
Work Order:
0

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

The _flushReshardingStateChange command doesn't attempt to join with any earlier running shard version refreshes. It was believed for this to be safe due to shard version refreshes not being possible while the critical section is held. However, it is possible for an earlier _flushReshardingStateChange command that had been interrupted by stepdown to have called CollectionShardingRuntime::setShardVersionRecoverRefreshFuture() and for the RecoverRefreshThread to not yet have finished running the shard version refresh where the RecoverRefreshThread would have called CollectionShardingRuntime::resetShardVersionRecoverRefreshFuture().

This leads the _flushReshardingStateChange command to hit this invariant in CollectionShardingRuntime::setShardVersionRecoverRefreshFuture().

Proposed solution: Rather than attempting to make the _flushReshardingStateChange command attempt to join with a shard version refresh triggered by any earlier instances of the command, we could instead introduce a new _shardsvrCommitReshardCollection command analogous to the _shardsvrAbortReshardCollection command introduced in ~~SERVER-56638~~. The _shardsvrCommitReshardCollection would

Call _coordinatorHasDecisionPersisted.emplaceValue().
Wait on DonorStateMachine::getCompletionFuture() and RecipientStateMachine::getCompletionFuture().
Wait for the latest optime to become majority-committed.

With the proposed _shardsvrCommitReshardCollection command, DonorStateMachine and RecipientStateMachine would additionally need to be changed to call CollectionShardingRuntime::clearFilteringMetadata() prior to releasing the critical section. This is needed to guarantee that a stale mongos cannot get a response of "no documents" after the donor shard has dropped the original collection and would instead be told to refresh its shard version. DonorStateMachine and RecipientStateMachine should additionally call onShardVersionMismatch() after releasing the critical section to eagerly refresh their shard version and learn of the new collection epoch before the first operation for the namespace being resharded comes in.

is depended on by

SERVER-58343 Re-enable reshard_collection_failover_shutdown_basic.js

Closed

is related to

SERVER-56638 Fix flushReshardingStateChanges critical section race

Closed

SERVER-57952 Resharding donor shards cannot complete a shard version refresh after acquiring the critical section, stalling the resharding operation

Closed

related to

SERVER-58063 Alias "flushReshardingStateChanges" as "_shardsvrCommitReshardCollection" to allow 5.0.0 and 5.0.1 compatibility

Closed

Assignee:: Blake Oler

Reporter:: Max Hirschhorn

Participants:: Blake Oler, Githook User, Max Hirschhorn, Vivian Ge

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: Jun 22 2021 04:44:42 PM UTC

Updated:: Oct 29 2023 09:51:46 PM UTC

Resolved:: Jul 08 2021 05:12:59 PM UTC

Confidence Status Last Update:: 22/Jun/21 6:33 PM

Details

Description

Attachments

Issue Links

Activity

People

Dates