After yielding, operations will restore their lock state via restoreLockState. This function will iterate over each lock that was previously held and try to reacquire it in sorted order. However, we don't actually try to reacquire the FCV lock, which should be reacquired after the PBWM. When we go to try to reacquire the RSTL, we fail the check since the lock in question is actually the FCV lock (but we never checked for it). We will then acquire the global lock (including a acquiring read ticket) without having the FCV lock or the RSTL.
Once that is done, we will reacquire all the other locks we held, which in this case includes the RSTL (but now out of order).
When the stepdown thread starts, it enqueues the RSTL in X mode, which jumps to the top of the queue. At the same time, there will operations that are holding the RSTL in IX mode, but are waiting to acquire read tickets, which is preventing the stepdown thread from proceeding. If we have exhausted all read tickets in the system, then these threads are stuck waiting while holding the RSTL but the threads holding the read tickets cannot progress since they are stuck behind the stepdown thread waiting for the RSTL.
There is also a variation of this that can happen on step up when we are holding the RSTL and waiting on ticket acquisition.
We should be accounting for the FCV lock when we restore locks.
- is caused by
-
SERVER-65821 Deadlock during setFCV when there are prepared transactions that have not persisted commit/abort decision
- Closed
- related to
-
SERVER-84353 The test for stepDown deadlock with read ticket exhaustion is flaky
- Closed
-
SERVER-75262 Add a passthrough test that exercises ticket exhaustion
- Closed