Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-68661

Deadlock with transactions after step down

    • Type: Icon: Bug Bug
    • Resolution: Gone away
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None

      After SERVER-66972 (which delegates the refresh of the database version to another thread), the JS test retryable_findAndModify_commit_and_abort_prepared_txns_after_failover_and_restart.js started triggering a deadlock with the prepared transactions after a node step down. The function that refreshes the local database metadata tries to X-lock on the database, but it waits indefinitely because a prepare transaction acquired and keeps the IX-lock on the same database.

      Unfortunately, this is actually blocking the fix for SERVER-66972 (Database critical section does not serialize with ongoing refreshes).

       In details, the JS test runs the command below which never ends:

      {  "abortTransaction" : 1,  "lsid" : {  "id" : UUID("372cc71c-e185-409d-ad2e-465e053764a4"),  "txnNumber" : NumberLong(35),  "txnUUID" : UUID("e8666f6d-e097-4ccd-ad30-608b7494aebc") },  "txnNumber" : NumberLong(0),  "autocommit" : false }
      

      From the dump of the lock manager I see that the RefreshDbVersionThread is waiting to actually X-lock the database, while the resource is already IX-locked by another tread working on the transaction e8666f6d-e097-4ccd-ad30-608b7494aebc:

      {
         "lockAddr":"0x7f21e5c89b20",
         "resourceId":"{6237343057549539649: Database, 1625657039122151745, config}",
         "granted":[
            {
               "lockRequest":"0x7aa",
               "lockRequestAddr":"0x7f21e5cb6020",
               "thread":"thread::id of a non-executing thread",
               "mode":"IX",
               "convertMode":"NONE",
               "enqueueAtFront":false,
               "compatibleFirst":false,
               "debugInfo":"lsid: { id: UUID(\"372cc71c-e185-409d-ad2e-465e053764a4\"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855), txnNumber: 35, txnUUID: UUID(\"e8666f6d-e097-4ccd-ad30-608b7494aebc\") }"
            }
         ],
         "pending":[
            {
               "lockRequest":"0x911",
               "lockRequestAddr":"0x7f21ec477820",
               "thread":"139783425423104",
               "mode":"X",
               "convertMode":"NONE",
               "enqueueAtFront":false,
               "compatibleFirst":false,
               "debugInfo":"",
               "clientInfo":{
                  "desc":"RefreshDbVersionThread",
                  "opid":2311
               }
            },
            ...
         ]
      }
      

      Additional details:

      This problem looks similar (but not the same) as SERVER-62951.

            Assignee:
            kaloian.manassiev@mongodb.com Kaloian Manassiev
            Reporter:
            antonio.fuschetto@mongodb.com Antonio Fuschetto
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: