-
Type: Bug
-
Resolution: Duplicate
-
Priority: Major - P3
-
None
-
Affects Version/s: 3.6.1
-
Component/s: Replication, Stability
-
Replication
-
ALL
Note: this deadlock is similar to SERVER-28688 but this is another one.
Note: I observed this deadlock in 3.6.1.
ReplicationCoordinatorExternalStateImpl::shutdown calls _taskExecutor->join() while having _threadMutex locked. In most cases there are no tasks for worker threads and _taskExecutor->join() returns immediately. But in some rare situations DropPendingCollectionReaper has some collections to drop and while these tasks are running signal processing thread keeps _threadMutex locked. If at this moment replication logic decides to stepdown then we have a deadlock because ReplicationCoordinatorExternalStateImpl::startProducerIfStopped tries to acquire _threadMutex while holding the global exclusive lock. After startProducerIfStopped starts its wait for _threadMutex drop collection tasks are also blocked by the global lock.
Attached file contains output of mongodb-waitsfor-graph, mongodb-show-locks, mongodb-uniqstack commands. In this file:
- thread 2 (signalProcessingThread) owns _threadMutex lock (acquired in ReplicationCoordinatorExternalStateImpl::shutdown)
and waits for shutdown of worker threads (_taskExecutor->shutdown(); _taskExecutor->join()
- thread 47: "replexec-9" waits for _threadMutex (owned by thread 2)
is processing _stepDownFinish event
which calls _updateMemberStateFromTopologyCoordinator_inlock
which calls startProducerIfStopped
which tries to aquire _threadMutex
- thread 48 (worker thread executing dropCollection task)
waits for global lock owned by thread 47
- duplicates
-
SERVER-36873 ReplicationCoordinatorExternalStateImpl::shutdown() must not hold _threadMutex while waiting for _taskExecutor
- Closed
- related to
-
SERVER-28688 Deadlock between shutdown and stepdown
- Closed