Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 4.4.0-rc7, 4.7.0
Affects Version/s: None
Component/s: Sharding
Labels:
None

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Backport Requested:

v4.4
Sprint:
Sharding 2020-05-18
Linked BF Score:
28
Confidence Status:
None
Work Order:
3

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

When a new primary steps up in a shardsvr replica set, it launches a task to recover any migrations driven by the node's shard that were in-progress when the previous primary stepped down. As part of this, the recovery process will recover the outcome of each migration by loading the latest metadata from the config server and checking if the minimum bound from the migration still belongs to the donor shard. If it does, the migration is assumed to have aborted, and the recovery process updates the persisted range deleter state on the donor and recipient shards so any orphans on either are deleted.

If an interrupted migration committed successfully and its namespace had its shard key refined before the recovery process runs, the check for ownership will use the pre-refine minimum boundary but a post-refine routing table. This may result in a spurious overlap, which leads the recovery process to incorrectly decide the migration aborted, preventing any orphans on the donor from being cleaned up. The recipient will attempt to schedule a range deletion for the received range, which will fail with RangeOverlapConflict.

To fix this, the recovery process should extend the migration's min bound when performing the ownership check if the most recent shard key has more fields, like what was done in ~~SERVER-46386~~.

depends on

SERVER-46386 Refining a shard key may lead to an orphan range never being cleaned up

Closed

is duplicated by

SERVER-48209 Migration coordinator step-up recovery is not robust to intervening shard key refinement

Closed

is related to

SERVER-45983 Perform the shardVersion recovery and refresh on a separate thread from that of the user request

Closed

related to

SERVER-48242 Make stepping up secondary in range_deleter_interacts_correctly_with_refine_shard_key.js more robust

Closed

SERVER-48246 Wait for RSM to detect failover in range_deleter_interacts_correctly_with_refine_shard_key.js

Closed

Assignee:: Jack Mulrow
Reporter:: Jack Mulrow
Participants:: Githook User, Jack Mulrow
Votes:: 0 Vote for this issue
Watchers:: 4 Start watching this issue

Created:: May 13 2020 07:49:48 PM UTC
Updated:: Oct 29 2023 10:08:14 PM UTC
Resolved:: May 14 2020 10:01:28 PM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates