-
Type: Task
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: None
-
Component/s: None
-
Fully Compatible
-
v6.0, v5.0
-
Sharding EMEA 2022-07-25, Sharding EMEA 2022-08-08
-
(copied to CRM)
-
2
In order to ensure the correctness of renameCollection for sharded collections (supported since v5.0), we introduced some logic in rename coordinator/participant to make sure UUIDs are aligned across all shards.
If a catalog inconsistency is detected (namely different UUIDs for the source/target collection on different shards), the rename operation hangs spamming the logs with a message aimed to push the user to manual intervene.
This is an example of error emitted in the logs:
{"t":{"$date":"2022-05-21T00:37:40.719Z"},"s":"E","c":"SHARDING","id":6372200,"ctx":"RenameCollectionParticipantService-223","msg":"Error executing rename collection participant. Going to be retried.","attr":{"fromNs":"foo.sourceColl","toNs":"foo.TargetColl","error":"CommandFailed: Source Collection foo.sourceColl UUID does not match provided uuid."}}
Given that a bunch of users hit the error but got their collection stuck not knowing how to fix the catalog inconsistency, purpose of this ticket is to prevent ending up in this situation.
A possible way would be to broadcast a message to all shards in the checkPreconditions phase in order to early fail the operation in case an inconsistency is detected. (E.g. call a listCollections filtered by ns on all shards).
This would not fully prevent the hang to happen because after checking preconditions and before instantiating participants some direct client could create the source/target collection with different UUIDs on other shards. But the time frame for the bad interleaving will be so short to prevent 99% of the hangs.