Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-33186

Primary node may deadlock during shutdown

    • Replication
    • ALL

      Note: this deadlock is similar to SERVER-28688 but this is another one.
      Note: I observed this deadlock in 3.6.1.

      ReplicationCoordinatorExternalStateImpl::shutdown calls _taskExecutor->join() while having _threadMutex locked. In most cases there are no tasks for worker threads and _taskExecutor->join() returns immediately. But in some rare situations DropPendingCollectionReaper has some collections to drop and while these tasks are running signal processing thread keeps _threadMutex locked. If at this moment replication logic decides to stepdown then we have a deadlock because ReplicationCoordinatorExternalStateImpl::startProducerIfStopped tries to acquire _threadMutex while holding the global exclusive lock. After startProducerIfStopped starts its wait for _threadMutex drop collection tasks are also blocked by the global lock.

      Attached file contains output of mongodb-waitsfor-graph, mongodb-show-locks, mongodb-uniqstack commands. In this file:

      • thread 2 (signalProcessingThread) owns _threadMutex lock (acquired in ReplicationCoordinatorExternalStateImpl::shutdown)
        and waits for shutdown of worker threads (_taskExecutor->shutdown(); _taskExecutor->join()
      • thread 47: "replexec-9" waits for _threadMutex (owned by thread 2)
        is processing _stepDownFinish event
        which calls _updateMemberStateFromTopologyCoordinator_inlock
        which calls startProducerIfStopped
        which tries to aquire _threadMutex
      • thread 48 (worker thread executing dropCollection task)
        waits for global lock owned by thread 47

            Assignee:
            backlog-server-repl [DO NOT USE] Backlog - Replication Team
            Reporter:
            igorsol Igor Solodovnikov
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated:
              Resolved: