The OperationContext for ShardingCatalogManager::renameShardedMetadata() has a logical session checked out while doing an uninterruptible wait on the _kChunkOpLock. If the _kChunkOpLock is currently held (e.g. from a running _configsvrSetAllowMigrations command), then _configsvrRenameCollectionMetadata will block until the _kChunkOpLock is released. In particular, the _configsvrSetAllowMigrations command will acquire the _kChunkOpLock and then attempt to acquire additional LockManager locks such as the RSTL IX lock. If a stepdown occurs on the primary, then the RstlKillOpThread interrupt the OperationContext running ShardingCatalogManager::renameShardedMetadata(). But the uninterruptible wait means that the no attention is given to the kill status. ReplicationCoordinatorImpl::_stepDownFinish() will then block attempting to check out the logical session to kill it as part of invalidateSessionsForStepdown() while holding the RSTL X lock.
- _configsvrRenameCollectionMetadata (holding "logical session" resource) -> _kChunkOpLock
- _configsvrSetAllowMigrations (holding _kChunkOpLock) -> RSTL IX lock
- Stepdown (holding RSTL X lock) -> acquiring "logical session" resource
I think the solution here would be to make the _kChunkOpLock and _kZoneOpLock acquisitions interruptible by using the 3-argument constructor for Lock::ExclusiveLock.
Lock::ExclusiveLock chunkLk(opCtx, opCtx->lockState(), _kChunkOpLock); Lock::ExclusiveLock zoneLk(opCtx, opCtx->lockState(), _kZoneOpLock);
- is related to
-
SERVER-59226 Deadlock when stepping down with a profile session marked as uninterruptible
- Closed
-
SERVER-52564 Deadlock between step down and MongoDOperationContextSession
- Closed
-
SERVER-55007 Deadlock between step down and MongoDOperationContextSession
- Closed
-
SERVER-55573 Deadlock between stepdown and chunk migration
- Closed
-
SERVER-58364 ShardServerCatalogCacheLoader::waitForCollectionFlush should be interruptible
- Closed
-
SERVER-58775 Mark ConfigsvrSetAllowMigrationsCommand's opCtx as killable on stepdown
- Closed
-
SERVER-59329 Make sure that withTemporaryOperationContext throw an error if the node is no longer a primary
- Closed
-
SERVER-60521 Deadlock on stepup due to moveChunk command running uninterrupted on secondary
- Closed
-
SERVER-60958 Avoid server hang in chunk migration when step-down event occurs
- Closed
-
SERVER-70873 Stepdown during drop collection can lead to a deadlock
- Closed
-
SERVER-70888 ScopedRangeDeleterLock might lead to a deadlock on stepdown
- Closed
-
SERVER-76273 SessionCatalogMigrationDestination is not interruptible on stepdown
- Closed
- related to
-
SERVER-70003 Alternative client for deleting range deletion documents must be interruptible on stepdown
- Closed
-
SERVER-70127 Default system operations to be killable by stepdown
- Closed