-
Type: Bug
-
Resolution: Unresolved
-
Priority: Major - P3
-
None
-
Affects Version/s: 9.0 Required, 8.0.0, 8.0.1, 8.0.2
-
Component/s: Sharding
-
None
-
Catalog and Routing
-
ALL
-
-
CAR Team 2024-10-28, CAR Team 2024-11-11, CAR Team 2024-11-25
-
2
Usually cluster DDL operations do FCV checks at the beginning of the command execution in order to determine if a different code path is required for newer versions. Create collection is an example of this, where depending on the feature flag enabled and version, we launch a coordinator or not (this was added as part of SERVER-81190). Usually the idea is to have these FCV checks before holding any DDL locks.
SERVER-81960 added a FCV region in the configsvrReshardCollection command in order to support the new moveCollection operation, and the current design of resharding requires the orchestration to happen in the config server, but only after the primary db shard is holding the necessary DDL lock to serialize with other cluster level DDL. So, a resharding operation goes first to the primary db shard, acquires the DDL lock for the collection and then it goes to the config server.
This is usually fine, but in config shards, all the resharding coordinators might end up instantiated in the same shard, and if a setFeatureCompatibility command sneaks in at the right time, we might end up with the following interleaving:
t1: reshardCollection acquires a DDL lock when creating the db primary shard coordinator
t2: createCollection instantiates a FCV region
t2: Tries to acquire the DDL lock, but it ends up waiting for t1
t3: setFeatureCompatibilityVersion tries to acquire an exclusive lock, but it ends up waiting for t2
t1: In a remote request to itself, configsvrReshardCollection tries to instantiate a FCV region, but it ends up enqueuing the lock behind t3
Causing a 3-way deadlock. For a customer, all DDL operations for the database and collection that was being moved and setFeatureCompatibilityVersion commands will block for 5 minutes until t2 fails to acquire the DDLLock with LockBusy, which would then destroy the FCVRegion, allowing t3 to finish and then t1. If there is any other operation trying to get a FCVRegion (like timeseries batch writes), they would also block until the cluster goes back to normality.
One way of solving this, is moving out the FCV check from the configsvrReshardCollection command, and do it in the _shardsvrReshardCollection command, like all other cluster DDL does, another way is thinking really hard about how to make the FCVRegion in the create command not being held while the create is running, but we could still have a potentially dangerous situation if we leave the FCV region in the configsvrReshardCollection command. You can find a reproducible of this attached.
- is caused by
-
SERVER-81960 Separate moveCollection, unshardCollection and reshardCollection on provenance.
- Closed
- is depended on by
-
SERVER-85646 Add testing coverage for movePrimary during upgrade/downgrade in v8.0
- Blocked