Uploaded image for project: 'MongoDB Database Tools'
  1. MongoDB Database Tools
  2. TOOLS-3658

Investigate changes in SERVER-92530: Limit the time a transaction waits for a refresh on a shard in recovery state

    • Type: Icon: Investigation Investigation
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Tools and Replicator
    • 71

      Original Downstream Change Summary

      Changed the name of the metadataRefreshInTransactionMaxWaitBehindCritSecMS configuration parameter to metadataRefreshInTransactionMaxWaitMS
      metadataRefreshInTransactionMaxWaitBehindCritSecMS continues to be valid but deprecated

      Description of Linked Ticket

      As part of SERVER-59965 a “circuit breaker” has been introduced to prevent transaction from dead-lock on the critical section (issue carefully described by the ticket).

      However, as part of BF-34016 we realised a transaction can also block when the filtering metadata are UNKNOWN (as the shard won’t serve reads or writes).

      This can be more problematic in case the shard is in recovery state as part of the step up, where all migrations will be recovered, causing the writes or reads to be blocked for some time (as the metadata are cleared)

      As well explained by several comments on BF-34016, this can generate a dead-lock. The transaction might hold a lock that doesn’t allow the migration abortion (part of the migration recovering) to complete, and the migration can prevent the transaction from committing due to the shard version being UNKNOWN.

       

      The goal of the ticket is to limit the time a transaction spends waiting for the shard version recovery, similar to was done in SERVER-59965 for a transaction waiting on critical section. This will allow the transaction in such a rare case to abort, letting the migration abortion to complete.

       

      Note transactions already have a timeout of 1 minute, and the BF-34016 will be implicitly fixed by SERVER-86727. For this reason, the ticket can be marked as improvement to prevent similar issue from happening again.

            Assignee:
            Unassigned Unassigned
            Reporter:
            backlog-server-pm Backlog - Core Eng Program Management Team
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: