-
Type: Bug
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: 7.1.0-rc0, 6.0.6
-
Component/s: None
-
Catalog and Routing
-
Fully Compatible
-
ALL
-
v8.0, v7.3, v7.0, v6.0
-
Sharding EMEA 2023-10-16, Sharding EMEA 2023-10-30, CAR Team 2023-11-13, CAR Team 2023-11-27, CAR Team 2023-12-11, CAR Team 2023-12-25, CAR Team 2024-01-08, CAR Team 2024-02-05, CAR Team 2024-02-19, CAR Team 2024-03-04, CAR Team 2024-03-18, CAR Team 2024-04-01, CAR Team 2024-04-15
-
2
In a Mongosync test, we encountered a CollectionUUIDMismatch where the actual collection name was equal to the expected collection when issuing a delete command on a sharded collection in a 3-shard cluster.
The collection UUID of interest is e97a6bd1-498d-4dbf-8477-d77190fb744b for namespace "testDB.testColl2".
A first observation from the mongos logs is that the collection testColl2 consists of 2 chunks, one in shard dst-sh01 and another in dst-sh02, leaving dst-sh03 with no chunks. This was explicitly setup by Mongosync using the "updateZoneKeyRange" command on those two shards with those specific chunks and intentionally omitting dst-sh03.
{"t":{"$date":"2023-07-17T20:36:38.985+00:00"},"s":"D3", "c":"EXECUTOR", "id":22608, "ctx":"ShardRegistry","msg":"Received remote response","attr":{"response":"RemoteOnAnyResponse -- cmd: { cursor: { firstBatch: [ { _id: \"testDB.testColl2\", lastmodEpoch: ObjectId('64b5a6557ce7fa44 aa9b9b67'), lastmod: new Date(1689626198046), timestamp: Timestamp(1689626197, 140), uuid: UUID(\"e97a6bd1-498d-4dbf-8477-d77190fb744b\"), key: { _id: 1 }, unique: false, noBalance: false }, { chunks: { _id: ObjectId('64b5a656bbcf3f85c9eb21f7'), uuid: UUID(\"e97a6bd1-498d-4dbf-8477-d77 190fb744b\"), min: { _id: MinKey }, max: { _id: 74 }, shard: \"dst-sh01\", lastmod: Timestamp(1, 0), onCurrentShardSince: Timestamp(1689626197, 140), history: [ { validAfter: Timestamp(1689626197, 140), shard: \"dst-sh01\" } ] } }, { chunks: { _id: ObjectId('64b5a656bbcf3f85c9eb21f8'), uuid: UUID(\"e97a6bd1-498d-4dbf-8477-d77190fb744b\"), min: { _id: 74 }, max: { _id: MaxKey }, shard: \"dst-sh02\", lastmod: Timestamp(1, 1), onCurrentShardSince: Timestamp(1689626197, 140), history: [ { validAfter: Timestamp(1689626197, 140), shard: \"dst-sh02\" } ] } } ], id: 0, ns: \"config.collections\", atClusterTime: Timestamp(1689626198, 120) }, ok: 1.0, $clusterTime: { clusterTime: Timestamp(1689626198, 134), signature: { hash: BinData(0, 0000000000000000000000000000000000000000), keyId: 0 } }, $configTime: Timestamp(1689626198, 120), $topologyTime: Timestam p(1689626063, 2), operationTime: Timestamp(1689626198, 120) } target: localhost:28121 status: OK elapsedMicros: 2900 μs moreToCome: false"}}
The delete command is shown in the mongos line:
{"t":{"$date":"2023-07-17T20:36:38.986+00:00"},"s":"D4", "c":"ASIO", "id":22596, "ctx":"conn259","msg":"startCommand","attr":{"request":"RemoteCommand 12610 -- target:[localhost:28028] db:testDB expDate:2023-07-17T20:41:38.990+00:00 cmd:{ delete: \"testColl2\", bypassDocumentVali dation: false, ordered: false, collectionUUID: UUID(\"e97a6bd1-498d-4dbf-8477-d77190fb744b\"), deletes: [ { q: { $and: [ { $expr: { $gte: [ \"$_id\", { $literal: 0 } ] } }, { $expr: { $lt: [ \"$_id\", { $literal: 99 } ] } } ] }, limit: 0, hint: { _id: 1 } } ], shardVersion: { e: Object Id('00000000ffffffffffffffff'), t: Timestamp(4294967295, 4294967295), v: Timestamp(0, 0) }, writeConcern: { w: \"majority\", j: true, wtimeout: 120000 }, lsid: { id: UUID(\"6e1d23ea-ac74-484f-ab44-341b08fcbfac\"), uid: BinData(0, FFEE4C8F085C89ED4D839F06F64A1B5513D7CA6A5BC5F3EE7052E138 5F4965D3) } }"}}
This command is being sent to all three shards, include dst-sh03 at localhost:28028 which doesn't have any testColl2 chunks.
The commands that are sent to dst-sh01 and dst-sh02 return without any errors.
The command sent to dst-sh03 returns with a CollectionUUIDMismatch error:
{"t":{"$date":"2023-07-17T20:36:38.987+00:00"},"s":"D2", "c":"ASIO", "id":22597, "ctx":"conn259","msg":"Request finished with response","attr":{"requestId":12610,"isOK":true,"response":"{ n: 0, electionId: ObjectId('7fffffff0000000000000002'), opTime: { ts: Timestamp(1689626198, 76), t: 2 }, writeErrors: [ { index: 0, code: 361, errmsg: \"Collection UUID does not match that specified\", db: \"testDB\", collectionUUID: UUID(\"e97a6bd1-498d-4dbf-8477-d77190fb744b\"), expectedCollection: \"testColl2\", actualCollection: null } ], ok: 1.0, $clusterTime: { clusterTime: Timestamp(1689626198, 134), signature: { hash: BinData(0, 0000000000000000000000000000000000000000), keyId: 0 } }, $configTime: Timestamp(1689626198, 120), $topologyTime: Timestamp(1689626063, 2), operationTime: Timestamp(1689626198, 76) }"}}
The expectedCollection is "testColl2" and the actualCollection is null.
However, when this response gets to Mongosync the error becomes
{\"index\": {\"$numberInt\":\"0\"},\"code\": {\"$numberInt\":\"361\"},\"errmsg\": \"Collection UUID does not match that specified\",\"db\": \"testDB\",\"collectionUUID\": {\"$binary\":{\"base64\":\"6Xpr0UmNTb+Ed9dxkPt0Sw==\",\"subType\":\"04\"}},\"expectedCollection\": \"testColl2\",\"actualCollection\": \"testColl2\"}"}
where expectedCollection and actualCollection are both "testColl2".
Mongosync was not expecting to receive a CollectionUUIDMismatch error at all since testColl2 exists.
Since the returned actual collection name is the same as the expected collection name, the delete command is retried without changing the expected collection name. This results in the same CollectionUUIDMismatch error. In this test, Mongosync retries 5 times before giving up and erroring out.
This might be linked to SERVER-76624, and that its is important for mongosync that we get this resolved.
- is caused by
-
SERVER-63285 Ensure CollectionUUIDMismatch error from write commands does not omit the actual collection even if unsharded
- Closed
- related to
-
SERVER-76624 Server can falsely report CollectionUUIDMismatch
- Closed
-
SERVER-89361 Wrong number of documents reported deleted when using batched deletes in 6.0
- Closed