Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 8.1.0-rc0, 8.0.5
Affects Version/s: 8.0.0, 8.0.1, 8.0.2, 9.0 Required
Component/s: Sharding
Labels:
None

Assigned Teams:

Catalog and Routing
Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Backport Requested:

v8.0
Steps To Reproduce:

Hide

1. Apply the attached patch in revision
2. Run the test with python buildscripts/resmoke.py run --suites=sharding jstests/sharding/reproduce_setFCV_moveColl_createColl_deadlock.js

Beware that this is a deadlock, so, the test will not finish until after 5 minutes at least, which is when the LockBusy error will be thrown by the createCollection thread.

Show
1. Apply the attached patch in revision 2. Run the test with python buildscripts/resmoke.py run --suites=sharding jstests/sharding/reproduce_setFCV_moveColl_createColl_deadlock.js Beware that this is a deadlock, so, the test will not finish until after 5 minutes at least, which is when the LockBusy error will be thrown by the createCollection thread.
Sprint:
CAR Team 2024-10-28, CAR Team 2024-11-11, CAR Team 2024-11-25, CAR Team 2024-12-23, CAR Team 2025-01-06
Story Points:
2
Confidence Status:
None
Work Order:
3

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

Usually cluster DDL operations do FCV checks at the beginning of the command execution in order to determine if a different code path is required for newer versions. Create collection is an example of this, where depending on the feature flag enabled and version, we launch a coordinator or not (this was added as part of ~~SERVER-81190~~). Usually the idea is to have these FCV checks before holding any DDL locks.

~~SERVER-81960~~ added a FCV region in the configsvrReshardCollection command in order to support the new moveCollection operation, and the current design of resharding requires the orchestration to happen in the config server, but only after the primary db shard is holding the necessary DDL lock to serialize with other cluster level DDL. So, a resharding operation goes first to the primary db shard, acquires the DDL lock for the collection and then it goes to the config server.

This is usually fine, but in config shards, all the resharding coordinators might end up instantiated in the same shard, and if a setFeatureCompatibility command sneaks in at the right time, we might end up with the following interleaving:

t1: reshardCollection acquires a DDL lock when creating the db primary shard coordinator
t2: createCollection instantiates a FCV region
t2: Tries to acquire the DDL lock, but it ends up waiting for t1
t3: setFeatureCompatibilityVersion tries to acquire an exclusive lock, but it ends up waiting for t2
t1: In a remote request to itself, configsvrReshardCollection tries to instantiate a FCV region, but it ends up enqueuing the lock behind t3

Causing a 3-way deadlock. For a customer, all DDL operations for the database and collection that was being moved and setFeatureCompatibilityVersion commands will block for 5 minutes until t2 fails to acquire the DDLLock with LockBusy, which would then destroy the FCVRegion, allowing t3 to finish and then t1. If there is any other operation trying to get a FCVRegion (like timeseries batch writes), they would also block until the cluster goes back to normality.

One way of solving this, is moving out the FCV check from the configsvrReshardCollection command, and do it in the _shardsvrReshardCollection command, like all other cluster DDL does, another way is thinking really hard about how to make the FCVRegion in the create command not being held while the create is running, but we could still have a potentially dangerous situation if we leave the FCV region in the configsvrReshardCollection command. You can find a reproducible of this attached.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

deadlock_repro.js
6 kB
Oct 04 2024 07:25:04 PM UTC

depends on

SERVER-97468 Fully disable trackUnshardedCollectionsUponCreation in tests

Closed

is caused by

SERVER-81960 Separate moveCollection, unshardCollection and reshardCollection on provenance.

Closed

is depended on by

SERVER-85646 Add testing coverage for movePrimary during upgrade/downgrade in v8.0

In Code Review

Assignee:: Enrico Golfieri
Reporter:: Marcos José Grillo Ramirez
Participants:: Enrico Golfieri, Githook User, Marcos José Grillo Ramirez
Votes:: 0 Vote for this issue
Watchers:: 11 Start watching this issue

Created:: Oct 04 2024 07:25:54 PM UTC
Updated:: Feb 06 2025 04:42:25 PM UTC
Resolved:: Dec 20 2024 04:12:16 PM UTC
Confidence Status Last Update:: 22/Oct/24 7:59 AM

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates