Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Duplicate
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
- car-investigation

Assigned Teams:

Catalog and Routing
Operating System:
ALL
Sprint:
CAR Team 2024-08-05
Confidence Status:
None
Work Order:
3

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

As described for a particular case in ~~SERVER-87927~~, we have a race possible where a movePrimary interleaves with the setFeatureCompatibilityVersion upgrade / downgrade checks & cleanup actions and bypasses them. To break this down:
1. Let's say we have a collection living on Shard A that would trigger the checks in _userCollectionsUassertsForDowngrade(...).
2. An FCV downgrade command is received, and the config server tells both shards to start downgrading.
3. Shard A and Shard B both finish reaching the "transitioning" state.
4. A movePrimary operation starts to copy the problematic collection from Shard A to Shard B.
5. The movePrimary runs before Shard A performs the user collection checks, but after Shard B has already completed them. Therefore, the collection is copied over from Shard A to Shard B and isn't caught on Shard B's end, since Shard B completed them before the collection was copied over.
6. After the movePrimary, Shard A runs the user collection checks. But there's nothing to check for (the collection was already migrated), so it passes.
7. Therefore the FCV downgrade completes without triggering any checks.

We may have to rethink when / where we call _userCollectionsUassertsForDowngrade and _internalServerCleanupForDowngrade, or maybe if we need to disallow migrations while we're in the "transitioning" FCV state. While the example above uses movePrimary, as a part of this ticket we should make sure such a bug isn't possible with the other migration methods we have like moveChunk or resharding.

This bug was found when I added code to the _internalServerCleanupForDowngrade() function.

And it seems like SERVER-87297 is a particular instance of this bug.

duplicates

SERVER-91702 Removal of recordIdsReplicated leaves inconsistent metadata on downgrade for sharded clusters

Backlog

is related to

SERVER-91702 Removal of recordIdsReplicated leaves inconsistent metadata on downgrade for sharded clusters

Backlog

SERVER-87927 movePrimary + FCV downgrade race could potentially result in timeseriesBucketingParametersHaveChanged existing on 7.0

Closed

SERVER-88238 checkMetadataConsistency interleaves with collMod during upgrade / downgrade

Closed

related to

SERVER-90094 collMods occurring as part of setFCV may lose some collections in sharded clusters

Closed

SERVER-89634 Enable tests that concurrently perform DDL and setFCV operations on the recordIdsReplicated:true variant

Closed

(1 related to)

Assignee:: Marcos José Grillo Ramirez
Reporter:: Vishnu Kaushik
Participants:: Marcos José Grillo Ramirez, Vishnu Kaushik
Votes:: 0 Vote for this issue
Watchers:: 8 Start watching this issue

Created:: Mar 13 2024 08:39:19 PM UTC
Updated:: Jul 23 2024 10:00:51 AM UTC
Resolved:: Jul 23 2024 10:00:50 AM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates