-
Type: Task
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: None
-
Component/s: Sharding
-
Fully Compatible
-
Sharding 2019-01-14, Sharding 2019-01-28, Sharding 2019-02-11
We use UninterruptibleLockGuard in several places for movePrimaries. Since they acquire strong locks on normal databases and collections, they will be blocked by prepared transactions, causing deadlock on stepdown or shutdown.
Here's a list of all the occurrences of UninterruptibleLockGuard for movePrimaries.
- src/mongo/db/s/move_primary_source_manager.cpp:251
- src/mongo/db/s/move_primary_source_manager.cpp:352
We will employ a new DatabaseShardingStateLock (similar to the CollectionShardingRuntimeLock) to safeguard concurrent access to the database critical section.
When leaving critical section no longer conflicts with prepared transactions, they can run after prepared transactions yield locks on stepdown or being killed on shutdown. See SERVER-38162 for the order of events on shutdown and SERVER-38282 for stepdown.
The following function signature changes will allow us to relax the database locking to IX instead of X under the above UninterruptibleLockGuards. Database locks highlighted in bold will have a proposed change. All other database locks remain unchanged.
- enterCriticalSectionCatchUpPhase: Database X Lock, DSSLock X Lock
- enterCriticalSectionCommitPhase: Database X Lock, DSSLock X Lock
- extitCriticalSection: Database
XIX Lock, DSSLock X Lock - getCriticalSectionSignal: Database IS Lock, DSSLock IS Lock
- getDbVersion: Database IS or X Lock (situational), DSS IS Lock
- setDbVersion (when setting dbVersion to a meaningful value): Database X Lock, DSS X Lock
- setDbVersion (when setting dbVersion to boost::none, AKA "clearing" the dbVersion): Database
XIX Lock, DSS X - checkDbVersion: Database IS or X Lock (situational), DSS IS Lock
- getMovePrimarySourceManager: No Database Lock (reflects current usage), DSS IS Lock
- setMovePrimarySourceManager: Database X Lock, DSS X Lock
- clearMovePrimarySourceManager: Database
XIX Lock, DSS X Lock
These situations can use relaxed locking as a result:
- On the source node for movePrimary, if we can't commit the moved primary, we check if the node has stepped down. If it has stepped down, we must set the database version to boost::none to indicate that the we now do not know the authoritative version. Since setting the database version to boost::none is semantically equivalent to "clearing" the database version, we can now use a Database IX Lock instead of a Database X Lock when doing so. The DSSLock in exclusive mode will prevent concurrent changes to the database version.
- On the source node for movePrimary, we must clear state variables on cleanup. This includes clearing the movePrimarySourceManager and criticalSection variables. We may now use a Database IX Lock instead of a Database X Lock, since the DSSLock in exclusive mode will prevent concurrent changes to the database sharding state's in-memory variables.
- related to
-
SERVER-33577 Remove UninterruptibleLockGuards in sharding code to allow interruptible lock acquisition
- Closed