Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Won't Fix
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 8.0.0-rc0, 7.3.0
Component/s: None
Labels:
- car-investigation

Assigned Teams:

Catalog and Routing
Operating System:
ALL
Sprint:
CAR Team 2024-04-01, CAR Team 2024-04-15, CAR Team 2024-04-29, CAR Team 2024-05-13, CAR Team 2024-05-27, CAR Team 2024-06-10, CAR Team 2024-06-24, CAR Team 2024-07-08
Linked BF Score:
200
Confidence Status:
None
Work Order:
3

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

The aggregate commands used by checkMetadataConsistency [1] [2] don't set a {readConcern: {level: 'snapshot', atClusterTime: <TS>}}. So that means that the prior reference metadata captured by the shard may be stale if the metadata is modified before the aggregate command runs.

As a result, it's possible for a collMod (or some other metadata modifying operation) that acts on a shard directly without taking the DDL lock, such as during an upgrade / downgrade, to interleave with the checkMetadataConsistency command, and create a situation where the previous metadata doesn't match with the new metadata, even for the same shard.

It's not clear whether the bug here is that checkMetadataConsistency doesn't use a snapshot, or that collMod during upgrade / downgrade doesn't take the DDL lock

Reproducer where I issue a collMod to the shard directly, to make it interleave with checkMetadataConsistency, and checkMetadataConsistency complains that shard 0's metadata doesn't match its own metadata:

// Shard the coll
mongos> db.adminCommand({
  shardCollection: 'test.mycoll',
  key: {_id: 1}
})

// On the shard that the collection lives, set a failpoint here:
// https://github.com/mongodb/mongo/blob/aadd0e171ac7aa8982618db9aad0dab283d7cdeb/src/mongo/db/s/metadata_consistency_util.cpp#L649
shard-rs0:primary> db.adminCommand({
    configureFailPoint: "pauseBeforeAgg",
    mode: "alwaysOn"
});

// Try to check metadata consistency - this will hang on the failpoint.
mongos> db.checkMetadataConsistency();

// Collmod on the shard directly. This is something that upgrading / downgrading
// would usually trigger:
shard-rs0:primary> db.runCommand({collMod: 'mycoll', validator: {a: {$gt: -10}}});

// Turn off the failpoint to let checkMetadataConsistency complete
shard-rs0:primary> db.adminCommand({
    configureFailPoint: "pauseBeforeAgg",
    mode: "off"
});

// checkMetadataConsistency would have errored:
{
	"cursor" : {
		"id" : NumberLong(0),
		"ns" : "test.$cmd.aggregate",
		"firstBatch" : [
			{
				"type" : "CollectionOptionsMismatch",
				"description" : "Collection registered on the sharding catalog not found on the given shards",
				"details" : {
					"namespace" : "test.mycoll",
					"options" : [
						{
							"shards" : [
								"shard-rs0"
							],
							"options" : {
								"uuid" : UUID("095a4222-0ba3-4d22-b295-fbbf010ce6f9"),
								"validator" : {
									"a" : {
										"$gt" : -10
									}
								},
								"validationLevel" : "strict",
								"validationAction" : "error"
							}
						},
						{
							"shards" : [
								"shard-rs0"
							],
							"options" : {
								"uuid" : UUID("095a4222-0ba3-4d22-b295-fbbf010ce6f9")
							}
						}
					]
				}
			}
		]
	},
	"ok" : 1,
	...
}

is related to

SERVER-91702 Removal of recordIdsReplicated leaves inconsistent metadata on downgrade for sharded clusters

Backlog

related to

SERVER-87931 MovePrimary + FCV upgrade / downgrade race may dodge FCV cleanup & checks

Closed

SERVER-90094 collMods occurring as part of setFCV may lose some collections in sharded clusters

Closed

SERVER-92026 drop_database_sharded_setFCV.js should not run with recordIdsReplicated:true

Closed

Assignee:: Unassigned
Reporter:: Vishnu Kaushik
Participants:: Vishnu Kaushik
Votes:: 0 Vote for this issue
Watchers:: 7 Start watching this issue

Created:: Mar 19 2024 09:18:03 PM UTC
Updated:: Aug 27 2024 07:21:32 PM UTC
Resolved:: Jul 03 2024 06:21:17 PM UTC
Confidence Status Last Update:: 27/Jun/24 8:39 AM

Details

Description

Attachments

Issue Links

Activity

People

Dates