-
Type: Bug
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: None
-
Component/s: Replication
-
None
-
Fully Compatible
-
ALL
-
v4.2
-
Repl 2019-07-01, Repl 2019-07-15, Repl 2019-07-29
Currently, as part of yieldLocksForPreparedTransactions()], step down unstashes the prepared transaction's lock resources from transactionParticipant to its opCtx with maxLockTimeout set to non-zero value (by default it is 5ms). This means we reacquire the ticket with a maxLockTimeout set. This can fail and if it its an unconditional step down (step down via hb/ force reconfig), it can lead to server crash.
Extra Notes:
Below is the scenario where we can run out of tickets.
Assume MaxTicketsAvailable=10
1. 10 prepared txns have acquired the ticket while unstashing the txn resource.
2. 11th Prepared txn is waiting to acquire the ticket.
3. Step down marks canAcceptNonLocalWrite flag to false with RSTL lock in X & repl mutex lock held.
4. YiledLocksForPreparedTxn - Scans the catalog session and marks the 1st prepared txn as killed.
5. Let’s assume, 1st prepared txn checked in the session (released the ticket) because it found out it got killed while acquiring RSTL lock in IX mode. As a result one ticket is now available to acquire.
6. 11th session acquires the ticket. Now, no more tickets are available to be assigned.
7. YieldLocksForPreparedTxn - Session catalog scan marks the 11th transaction to be killed.
8. YieldLocksForPreparedTxn - Now, step down attempts to checkin the session and unstash the transaction. So, step-down should wait for a ticket to be available and it can timeout.
- related to
-
SERVER-41556 Must handle failure to reacquire locks and ticket when unstashing transaction
- Closed