-
Type: Bug
-
Resolution: Won't Fix
-
Priority: Major - P3
-
None
-
Affects Version/s: 8.0.0-rc0, 7.3.0
-
Component/s: None
-
Catalog and Routing
-
ALL
-
CAR Team 2024-04-01, CAR Team 2024-04-15, CAR Team 2024-04-29, CAR Team 2024-05-13, CAR Team 2024-05-27, CAR Team 2024-06-10, CAR Team 2024-06-24, CAR Team 2024-07-08
-
200
The aggregate commands used by checkMetadataConsistency [1] [2] don't set a {readConcern: {level: 'snapshot', atClusterTime: <TS>}}. So that means that the prior reference metadata captured by the shard may be stale if the metadata is modified before the aggregate command runs.
As a result, it's possible for a collMod (or some other metadata modifying operation) that acts on a shard directly without taking the DDL lock, such as during an upgrade / downgrade, to interleave with the checkMetadataConsistency command, and create a situation where the previous metadata doesn't match with the new metadata, even for the same shard.
It's not clear whether the bug here is that checkMetadataConsistency doesn't use a snapshot, or that collMod during upgrade / downgrade doesn't take the DDL lock
Reproducer where I issue a collMod to the shard directly, to make it interleave with checkMetadataConsistency, and checkMetadataConsistency complains that shard 0's metadata doesn't match its own metadata:
// Shard the coll mongos> db.adminCommand({ shardCollection: 'test.mycoll', key: {_id: 1} }) // On the shard that the collection lives, set a failpoint here: // https://github.com/mongodb/mongo/blob/aadd0e171ac7aa8982618db9aad0dab283d7cdeb/src/mongo/db/s/metadata_consistency_util.cpp#L649 shard-rs0:primary> db.adminCommand({ configureFailPoint: "pauseBeforeAgg", mode: "alwaysOn" }); // Try to check metadata consistency - this will hang on the failpoint. mongos> db.checkMetadataConsistency(); // Collmod on the shard directly. This is something that upgrading / downgrading // would usually trigger: shard-rs0:primary> db.runCommand({collMod: 'mycoll', validator: {a: {$gt: -10}}}); // Turn off the failpoint to let checkMetadataConsistency complete shard-rs0:primary> db.adminCommand({ configureFailPoint: "pauseBeforeAgg", mode: "off" }); // checkMetadataConsistency would have errored: { "cursor" : { "id" : NumberLong(0), "ns" : "test.$cmd.aggregate", "firstBatch" : [ { "type" : "CollectionOptionsMismatch", "description" : "Collection registered on the sharding catalog not found on the given shards", "details" : { "namespace" : "test.mycoll", "options" : [ { "shards" : [ "shard-rs0" ], "options" : { "uuid" : UUID("095a4222-0ba3-4d22-b295-fbbf010ce6f9"), "validator" : { "a" : { "$gt" : -10 } }, "validationLevel" : "strict", "validationAction" : "error" } }, { "shards" : [ "shard-rs0" ], "options" : { "uuid" : UUID("095a4222-0ba3-4d22-b295-fbbf010ce6f9") } } ] } } ] }, "ok" : 1, ... }
- is related to
-
SERVER-91702 Removal of recordIdsReplicated leaves inconsistent metadata on downgrade for sharded clusters
- Backlog
- related to
-
SERVER-87931 MovePrimary + FCV upgrade / downgrade race may dodge FCV cleanup & checks
- Closed
-
SERVER-90094 collMods occurring as part of setFCV may lose some collections in sharded clusters
- Closed
-
SERVER-92026 drop_database_sharded_setFCV.js should not run with recordIdsReplicated:true
- Closed