-
Type: Bug
-
Resolution: Done
-
Priority: Major - P3
-
Affects Version/s: None
-
Component/s: Replication
-
None
-
Fully Compatible
-
ALL
-
v3.4
-
-
Repl 2017-05-08, Repl 2017-05-29, Repl 2017-07-10
-
(copied to CRM)
-
15
It is possible that while the "rsBackgroundSync" thread is changing the member state to ROLLBACK for a thread running work on the ReplicationExecutor to need to acquire a lock. This design of holding a LockManager lock while waiting on a condition variable outside of the lock hierarchy seems prone to deadlock. For example, in the GDB output below, thread #39 is holding the Global lock in MODE_X and waiting for its task to set the follower mode to MemberState::RS_ROLLBACK in the ReplicationExecutor. The ReplicationExecutor is currently processing a vote response in thread #13 which waiting for the storage engine to make it durable. The durability thread (#6) is waiting to acquire the MMAPv1 flush lock, which is implicitly held by thread #39 as part of acquiring the global lock.
Thread 39 (Thread 0x7fc1e03f0700 (LWP 20506)): #0 0x00007fc27f0f5404 in pthread_cond_wait@@GLIBC_2.3.2 () from target:/lib/x86_64-linux-gnu/libpthread.so.0 #1 0x00007fc2826fba7c in std::condition_variable::wait(std::unique_lock<std::mutex>&) () #2 0x00007fc28170ea8b in mongo::repl::ReplicationExecutor::Event::waitUntilSignaled() () #3 0x00007fc2816f0e7d in mongo::repl::ReplicationCoordinatorImpl::setFollowerMode(mongo::repl::MemberState const&) () #4 0x00007fc281735ef8 in mongo::repl::rollback(mongo::OperationContext*, mongo::repl::OplogInterface const&, mongo::repl::RollbackSource const&, int, mongo::repl::ReplicationCoordinator*, mongo::repl::StorageInterface*, std::function<void (int)>) () #5 0x00007fc2816037c2 in mongo::repl::BackgroundSync::_runRollback(mongo::OperationContext*, mongo::Status const&, mongo::HostAndPort const&, int, mongo::repl::StorageInterface*) () #6 0x00007fc281605b0e in mongo::repl::BackgroundSync::_produce(mongo::OperationContext*) () #7 0x00007fc28160661a in mongo::repl::BackgroundSync::_runProducer() () #8 0x00007fc28160679a in mongo::repl::BackgroundSync::_run() () #9 0x00007fc2826fe690 in execute_native_thread_routine () #10 0x00007fc27f0f1184 in start_thread () from target:/lib/x86_64-linux-gnu/libpthread.so.0 #11 0x00007fc27ee1ebed in clone () from target:/lib/x86_64-linux-gnu/libc.so.6 ... Thread 13 (Thread 0x7fc1ed615700 (LWP 20473)): #0 0x00007fc27f0f5404 in pthread_cond_wait@@GLIBC_2.3.2 () from target:/lib/x86_64-linux-gnu/libpthread.so.0 #1 0x00007fc2826fba7c in std::condition_variable::wait(std::unique_lock<std::mutex>&) () #2 0x00007fc2818d2cab in mongo::CommitNotifier::awaitBeyondNow() () #3 0x00007fc2818d6a40 in mongo::dur::(anonymous namespace)::DurableImpl::waitUntilDurable() () #4 0x00007fc2816d57e0 in mongo::repl::ReplicationCoordinatorExternalStateImpl::storeLocalLastVoteDocument(mongo::OperationContext*, mongo::repl::LastVote const&) () #5 0x00007fc2816ff04b in mongo::repl::ReplicationCoordinatorImpl::_writeLastVoteForMyElection(mongo::repl::LastVote, mongo::executor::TaskExecutor::CallbackArgs const&) () #6 0x00007fc28170f840 in mongo::repl::ReplicationExecutor::_doOperation(mongo::OperationContext*, mongo::Status const&, mongo::executor::TaskExecutor::CallbackHandle const&, std::__cxx11::list<mongo::repl::ReplicationExecutor::WorkItem, std::allocator<mongo::repl::ReplicationExecutor::WorkItem> >*, std::mutex*) () #7 0x00007fc28170e0ed in mongo::repl::(anonymous namespace)::callNoExcept(std::function<void ()> const&) () #8 0x00007fc281715a30 in std::_Function_handler<mongo::repl::TaskRunner::NextAction (mongo::OperationContext*, mongo::Status const&), mongo::repl::ReplicationExecutor::scheduleDBWork(std::function<void (mongo::executor::TaskExecutor::CallbackArgs const&)> const&, mongo::NamespaceString const&, mongo::LockMode)::{lambda(mongo::OperationContext*, mongo::Status const&)#1}>::_M_invoke(std::_Any_data const&, mongo::OperationContext*&&, mongo::Status const&) () #9 0x00007fc28175d349 in mongo::repl::(anonymous namespace)::runSingleTask(std::function<mongo::repl::TaskRunner::NextAction (mongo::OperationContext*, mongo::Status const&)> const&, mongo::OperationContext*, mongo::Status const&) [clone .constprop.72] () #10 0x00007fc28175e46f in mongo::repl::TaskRunner::_runTasks() () #11 0x00007fc281bf38ec in mongo::ThreadPool::_doOneTask(std::unique_lock<std::mutex>*) () #12 0x00007fc281bf439c in mongo::ThreadPool::_consumeTasks() () #13 0x00007fc281bf4d56 in mongo::ThreadPool::_workerThreadBody(mongo::ThreadPool*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () #14 0x00007fc2826fe690 in execute_native_thread_routine () #15 0x00007fc27f0f1184 in start_thread () from target:/lib/x86_64-linux-gnu/libpthread.so.0 #16 0x00007fc27ee1ebed in clone () from target:/lib/x86_64-linux-gnu/libc.so.6 ... Thread 6 (Thread 0x7fc27cd1c700 (LWP 20466)): #0 0x00007fc27f0f57be in pthread_cond_timedwait@@GLIBC_2.3.2 () from target:/lib/x86_64-linux-gnu/libpthread.so.0 #1 0x00007fc281225fb8 in mongo::CondVarLockGrantNotification::wait(unsigned int) () #2 0x00007fc28122a6be in mongo::LockerImpl<true>::lockComplete(mongo::ResourceId, mongo::LockMode, unsigned int, bool) () #3 0x00007fc2812261d6 in mongo::AutoAcquireFlushLockForMMAPV1Commit::AutoAcquireFlushLockForMMAPV1Commit(mongo::Locker*) () #4 0x00007fc2818d7f1f in mongo::dur::durThread(mongo::ClockSource*, long) () #5 0x00007fc2826fe690 in execute_native_thread_routine () #6 0x00007fc27f0f1184 in start_thread () from target:/lib/x86_64-linux-gnu/libpthread.so.0 #7 0x00007fc27ee1ebed in clone () from target:/lib/x86_64-linux-gnu/libc.so.6
Thank you to benety.goh for helping me with the GDB output.
- is related to
-
SERVER-27154 replSetRequestVotes command should wait for durability
- Closed
-
SERVER-23908 MMAPv1 DurableImpl::waitUntilDurable should yield the flush lock
- Closed
-
SERVER-27282 Clean up and fix bugs in RS rollback error handling
- Closed