Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-67334

Accessing the opCtx decoration 'tenantIdToDeleteDecoration' from the on-commit hook of TenantMigrationRecipientOpObserver::onDelete() is not safe after the ttl batch deletion feature.

    • Type: Icon: Task Task
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 6.1.0-rc0
    • Affects Version/s: None
    • Component/s: None
    • None
    • Fully Compatible
    • Server Serverless 2022-06-27
    • 143

      PM-2227 made the ttl to perform batch deletes. As a result we might end up crashing the system while trying to access the uninitialized boost::optional 'tenantIdToDeleteDecoration' value (an opCtx decoration ) from the TenantMigrationRecipientOpObserver::onDelete()'s on-commit hook. Consider the below scenario.

      Assume, we started 2 migrations for tenant T1 & T2 with donor replica set rs0 and recipient replica set rs1.
      1) Migration T1 is committed successfully and R state doc was updated to get garbage collected and set the expiry time for that state doc as timestamp TS1 . This migration should have an R access blocker installed for T1.
      2) Assume Migration T2 is still in-progress, the cloud decided to abort and R primary ended up receiving recipientForgetMigration cmd before recipientSyncData cmd. This would not create a recipient access blocker for the migration T2. And, the R state doc for this migration is also updated to get garbage collected and the expiry time for this state doc is also set as  timestamp TS1 .
      3) Now, when TTL monitor scans for any expired documents in the recipient state doc collection, it would see 2 documents needed to be deleted. So, it would do the batch deletion by doing those 2 deletes in a single recovery unit using the same opCtx and assuming the order of the state doc deletion is
          i) Delete state doc for T1 - This would would set the `tenantIdToDeleteDecoration` on the opctx to be T1 and registers the on-commit hook to delete the T1's R access blocker as part of the tenant recipient op observer imp.
         ii) Delete state doc for T2 - This would would set the `tenantIdToDeleteDecoration` on the opctx to be boost::none and don't register the on-commit hook as we don't have the R tenant access blocker for T2.
      4) When the recovery unit of TTL batch deletion commits, we would run the T1's on-commit hook and leading to accessing uninitialized boost::optional `tenantIdToDeleteDecoration` value, leading to invariant failure and crashing the system.

            Assignee:
            christopher.caplinger@mongodb.com Christopher Caplinger
            Reporter:
            suganthi.mani@mongodb.com Suganthi Mani
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: