-
Type: Bug
-
Resolution: Unresolved
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
Networking & Observability
-
ALL
-
Networking & Obs 2024-10-14, Networking & Obs 2024-10-28, Networking & Obs 2024-11-11
In particular we have a faulty cyclical ordering of ConnPool --> ShardRegistry --> ReplCoordImpl --> ReplCoordExternalStateImpl --> ECM --> ConnPool that can happen as follows:
ConnPool mutex is taken while holding ECM mutex via ECM::dropConnections -> ConnectionPool::dropConnections
SR mutex is taken while holding ConnPool mutex via ConnectionPool::get -> ConnectionPool::_get --> STEP::addHost --> ShardRegistry::getConfigServer
ReplicationCoordinatorImpl::_mutex is taken while holding SR mutex via ShardingREplicaSetChangeListener::onConfirmedSet --> ShardREgistry::updateREplSEtHosts --> ShardRegistry::createShard --> RemoteCommandTargeter::create --> ReplicationCoordinatorImpl::getConfig
ReplicationCoordinatorExternalStateImpl::_threadMutex taken while holding ReplicationCoordinatorImpl::_mutex via ReplSetGetSTatus --> REplicationCoordinatorImpl::ProcessReplSetGetSTatus --> REplicationCoordinatorExternalStateImpl::tooStale
EgressConnectionCloserManager::_mutex taken while holding REplicationCoordinatorExternalSTate mutex via ReplicationCoordinatorExternalStateImpl::_stopDataReplcation -> ~ThreadPoolTAskExecutor -> ~NetworkInterfaceTL -> ~ConnectionPool -> ECM::remove
Since this is such a complicated set of inversions, and we've never seen it outside of TSAN, it is likely very improbable and may not be worth fixing. But, it requires a TSAN suppression and complicates the code and mutex understanding, so introducing ordering between these locks may be helpful.
- related to
-
SERVER-88159 mongo::Mutex masks TSAN's ability to detect a lock order inversion
- Closed
-
SERVER-93029 NetworkInterfaceThreadPool is an OutOfLineExecutor that regularly runs things inline
- Backlog