Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-91827

Lock order inversion between ConnectionPool::_mutex, ShardRegistry::_mutex, ReplicationCoordinatorImpl::_mutex, ReplicationCoordinatorExternalStateImpl::_threadMutex, and EgressConnectionCloserManager::_mutex

    • Type: Icon: Bug Bug
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Networking & Observability
    • ALL
    • Networking & Obs 2024-10-14, Networking & Obs 2024-10-28, Networking & Obs 2024-11-11, Networking & Obs 2024-11-25, Networking & Obs 2024-12-09

      In particular we have a faulty cyclical ordering of ConnPool --> ShardRegistry --> ReplCoordImpl --> ReplCoordExternalStateImpl --> ECM --> ConnPool that can happen as follows:

      ConnPool mutex is taken while holding ECM mutex via ECM::dropConnections -> ConnectionPool::dropConnections

      SR mutex is taken while holding ConnPool mutex via ConnectionPool::get -> ConnectionPool::_get --> STEP::addHost --> ShardRegistry::getConfigServer

      ReplicationCoordinatorImpl::_mutex is taken while holding SR mutex via ShardingREplicaSetChangeListener::onConfirmedSet --> ShardREgistry::updateREplSEtHosts --> ShardRegistry::createShard --> RemoteCommandTargeter::create --> ReplicationCoordinatorImpl::getConfig

      ReplicationCoordinatorExternalStateImpl::_threadMutex taken while holding ReplicationCoordinatorImpl::_mutex via ReplSetGetSTatus --> REplicationCoordinatorImpl::ProcessReplSetGetSTatus --> REplicationCoordinatorExternalStateImpl::tooStale

      EgressConnectionCloserManager::_mutex taken while holding REplicationCoordinatorExternalSTate mutex via ReplicationCoordinatorExternalStateImpl::_stopDataReplcation -> ~ThreadPoolTAskExecutor -> ~NetworkInterfaceTL -> ~ConnectionPool -> ECM::remove

      Since this is such a complicated set of inversions, and we've never seen it outside of TSAN, it is likely very improbable and may not be worth fixing. But, it requires a TSAN suppression and complicates the code and mutex understanding, so introducing ordering between these locks may be helpful.

            Assignee:
            Unassigned Unassigned
            Reporter:
            george.wangensteen@mongodb.com George Wangensteen
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated: