-
Type: Bug
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: None
-
Component/s: Sharding
-
Fully Compatible
-
ALL
-
v4.4, v4.2
-
Sharding 2020-11-30, Sharding 2020-12-14, Sharding 2020-12-28, Sharding 2021-01-11, Sharding 2021-01-25, Sharding 2021-02-08
-
(copied to CRM)
-
18
There is a deadlock between the thread that is running the process of stepping down and the session catalog migration producer. More concretely:
1. The thread that is running the invalidateSessionsForStepdown is holding a lock (RSTL lock) and is sitting on a condition variable waiting to check out session.
2. The session catalog migration thread is blocked here , waiting to get the lock held by [1] but it will never get it because this thread is also the one that should check out the session and notify [1].
The thread holding the RSTL lock on version 4.4 might have a stacktrace like the following:
#0 0x00007f1e44d01c3d in poll () from /lib64/libc.so.6 #1 0x000056130ba24f87 in mongo::transport::TransportLayerASIO::BatonASIO::run(mongo::ClockSource*) () #2 0x000056130ba0623d in mongo::transport::TransportLayerASIO::BatonASIO::run_until(mongo::ClockSource*, mongo::Date_t) () #3 0x000056130bef5821 in mongo::ClockSource::waitForConditionUntil(mongo::stdx::condition_variable&, mongo::BasicLockableAdapter, mongo::Date_t, mongo::Waitable*) () #4 0x000056130beeacd0 in mongo::OperationContext::waitForConditionOrInterruptNoAssertUntil(mongo::stdx::condition_variable&, mongo::BasicLockableAdapter, mongo::Date_t) () #5 0x000056130bea0795 in _ZZN5mongo13Interruptible32waitForConditionOrInterruptUntilISt11unique_lockINS_12latch_detail5LatchEEZNS_28CondVarLockGrantNotification4waitEPNS_16OperationContextENS_8DurationISt5ratioILl1ELl1000EEEEEUlvE_EEbRNS_4stdx18condition_variableERT_NS_6Date_tET0_PNS_10AtomicWordIlEEENKUlSJ_NS0_9WakeSpeedEE1_clESJ_SO_ () #6 0x000056130bea0daf in mongo::CondVarLockGrantNotification::wait(mongo::OperationContext*, mongo::Duration<std::ratio<1l, 1000l> >) () #7 0x000056130bea29c6 in mongo::LockerImpl::_lockComplete(mongo::OperationContext*, mongo::ResourceId, mongo::LockMode, mongo::Date_t) () #8 0x000056130beab773 in mongo::repl::ReplicationStateTransitionLockGuard::waitForLockUntil(mongo::Date_t) () #9 0x000056130a3269f7 in mongo::repl::ReplicationCoordinatorImpl::AutoGetRstlForStepUpStepDown::AutoGetRstlForStepUpStepDown(mongo::repl::ReplicationCoordinatorImpl*, mongo::OperationContext*, mongo::repl::ReplicationCoordinator::OpsKillingStateTransitionEnum, mongo::Date_t) () #10 0x000056130a34bee9 in mongo::repl::ReplicationCoordinatorImpl::_stepDownFinish(mongo::executor::TaskExecutor::CallbackArgs const&, mongo::executor::TaskExecutor::EventHandle const&) () ...
The other thread's stacktrace might be different depending on the operation, however, there will be a chunk migration thread on the session migration step (most likely on the SessionCatalogMigrationDestination class).
- causes
-
SERVER-57756 Race between concurrent stepdowns and applying transaction oplog entry
- Closed
- related to
-
SERVER-55007 Deadlock between step down and MongoDOperationContextSession
- Closed
-
SERVER-60161 Deadlock between config server stepdown and _configsvrRenameCollectionMetadata command
- Closed
-
SERVER-57167 Prevent throwing on session creation due to stepdown before stepdown completes
- Closed