Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-79026

Failing to cancel the JournalFlusher thread might lead to 3-way deadlock

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 7.1.0-rc0
    • Affects Version/s: 7.1.0-rc0
    • Component/s: None
    • None
    • Fully Compatible
    • ALL
    • Execution NAMR Team 2023-07-24
    • 140

      SERVER-73539 introduced replay protection for setAllowmigrations, as part of those changes (and the posterior fix SERVER-78021), we create an AlternativeClientRegion where a transaction with a majority write concern is performed. Currently the JournalFlusher is an unkillable thread that tries to get the RSTL lock when waiting until all commits before the call are durable in the journal, so, in the presence of a stepdown, the following scenario might happen in the config server:

      1. A thread with setAllowMigrations (which checked out a session) waits for the changes to the metadata to be majority committed
      2. A stepdown thread takes the RSTL lock and tries to checkout the session of 1. to kill it
      3. Another thread with the JournalFlusher tries to take the RSTL lock taken by 2.

      After 3 we have one thread (1) waiting for majority, but the thread that waits for the changes to become durable (2) is waiting for the RSTL lock that is taken by the stepdown thread (3) waiting for a session to be checked in, causing a 3-way deadlock. Attached to the ticket we can find 2 stacktraces with the problem described above.

      One way this could be solved is by making the JournalFlusher thread to also be killable like the main operation (in this case the setAllowMigrations thread).

        1. gdb_s0_n2.txt
          1.18 MB
        2. gdb.BFG-2016684.c_n1.txt
          1.37 MB

            Assignee:
            gregory.wlodarek@mongodb.com Gregory Wlodarek
            Reporter:
            marcos.grillo@mongodb.com Marcos José Grillo Ramirez
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: