-
Type: Bug
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: 7.1.0-rc0
-
Component/s: None
-
None
-
Fully Compatible
-
ALL
-
Execution NAMR Team 2023-07-24
-
140
SERVER-73539 introduced replay protection for setAllowmigrations, as part of those changes (and the posterior fix SERVER-78021), we create an AlternativeClientRegion where a transaction with a majority write concern is performed. Currently the JournalFlusher is an unkillable thread that tries to get the RSTL lock when waiting until all commits before the call are durable in the journal, so, in the presence of a stepdown, the following scenario might happen in the config server:
- A thread with setAllowMigrations (which checked out a session) waits for the changes to the metadata to be majority committed
- A stepdown thread takes the RSTL lock and tries to checkout the session of 1. to kill it
- Another thread with the JournalFlusher tries to take the RSTL lock taken by 2.
After 3 we have one thread (1) waiting for majority, but the thread that waits for the changes to become durable (2) is waiting for the RSTL lock that is taken by the stepdown thread (3) waiting for a session to be checked in, causing a 3-way deadlock. Attached to the ticket we can find 2 stacktraces with the problem described above.
One way this could be solved is by making the JournalFlusher thread to also be killable like the main operation (in this case the setAllowMigrations thread).
- is related to
-
SERVER-55745 The Fuzzer can run killOp on the JournalFlusher thread and cause it to throw an unexpected error
- Closed
-
SERVER-73539 stopMigrations/resumeMigrations don't have replay protection
- Closed
-
SERVER-78021 Retrying setAllowMigrations command may end up in a deadlock
- Closed
-
SERVER-70127 Default system operations to be killable by stepdown
- Closed
-
SERVER-74657 revisit if thread marked as unkillable is okay to be killable for storage execution related
- Closed
- related to
-
SERVER-79810 make JournalFlusher::waitForJournalFlush() interruptible when waiting for write concern
- Closed
-
SERVER-79174 Improve journal flusher interruption handling
- Closed