Currently, IndexBuildsCoordinatorMongod::voteCommitIndexBuild() violates the lock ordering, i.e., it tries to acquire RSTL lock in mode IX with ReplIndexBuildState::mutex held. As a result, it can deadlock with stepup code path (ReplicationCoordinatorImpl::signalDrainComplete), as it acquires RSTL lock in X mode first, and then tries to send abort or commit signal to index build by holding ReplIndexBuildState::mutex.
Note:
The ticket also address 3 more issues.
1) Currently, the index build (internal system thread) holds RSTl lock with uninterruptible guard enabled. And, it blocks replication state transition, like, step up, step down. (SERVER-44045)
2) We are acquiring collection lock in stronger mode (mode X) in order to commit or abort. As, a result, this can lead to deadlocks involving prepared transactions, stepdown and indexBuildsCoordinator. (SERVER-44722)
3) Currently IndexBuildsCoordinatorMongod::_waitForNextIndexBuildAction() holds RSTL only for the while loop scope. As a result, the primary check that we are doing at this line, can no longer be valid. (SERVER-46989)
4) Also, index build retries to vote on error without checking any interrupts, like, shutdown interrupts. This makes shutdown to hang forever, as it waits for the index builds to complete.
UPDATE: This ticket won't address the 3 additional issues. And it will be addressed separately.
- duplicates
-
SERVER-46989 Index builds should hold RSTL to prevent replication state changes after deciding to commit or abort
- Closed
- is related to
-
SERVER-46910 2 phase index builds should not try to vote when shutdown is in progress.
- Closed
-
SERVER-46917 Index builder on receiving commit/abort signal should cancel the active callback handle for the remote vote command request.
- Closed
- related to
-
SERVER-42621 3 way deadlock can happen between hybrid index build, prepared transactions and stepdown thread.
- Closed
-
SERVER-44722 3 way deadlock can happen between hybrid index build, prepared transactions and stepdown thread on primary that runs index build via coordinator.
- Closed
-
SERVER-46664 runCmdOnPrimaryAndAwaitResponse() should not run DBDirect client command with the rstl lock held.
- Closed
-
SERVER-44045 allow secondary index builds to start without unlocking the RSTL
- Closed