-
Type: Task
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: None
-
Component/s: None
-
None
-
Fully Compatible
-
Server Serverless 2022-06-27
-
143
PM-2227 made the ttl to perform batch deletes. As a result we might end up crashing the system while trying to access the uninitialized boost::optional 'tenantIdToDeleteDecoration' value (an opCtx decoration ) from the TenantMigrationRecipientOpObserver::onDelete()'s on-commit hook. Consider the below scenario.
Assume, we started 2 migrations for tenant T1 & T2 with donor replica set rs0 and recipient replica set rs1.
1) Migration T1 is committed successfully and R state doc was updated to get garbage collected and set the expiry time for that state doc as timestamp TS1 . This migration should have an R access blocker installed for T1.
2) Assume Migration T2 is still in-progress, the cloud decided to abort and R primary ended up receiving recipientForgetMigration cmd before recipientSyncData cmd. This would not create a recipient access blocker for the migration T2. And, the R state doc for this migration is also updated to get garbage collected and the expiry time for this state doc is also set as timestamp TS1 .
3) Now, when TTL monitor scans for any expired documents in the recipient state doc collection, it would see 2 documents needed to be deleted. So, it would do the batch deletion by doing those 2 deletes in a single recovery unit using the same opCtx and assuming the order of the state doc deletion is
i) Delete state doc for T1 - This would would set the `tenantIdToDeleteDecoration` on the opctx to be T1 and registers the on-commit hook to delete the T1's R access blocker as part of the tenant recipient op observer imp.
ii) Delete state doc for T2 - This would would set the `tenantIdToDeleteDecoration` on the opctx to be boost::none and don't register the on-commit hook as we don't have the R tenant access blocker for T2.
4) When the recovery unit of TTL batch deletion commits, we would run the T1's on-commit hook and leading to accessing uninitialized boost::optional `tenantIdToDeleteDecoration` value, leading to invariant failure and crashing the system.
- is related to
-
SERVER-63040 Batch TTL deletions
- Closed
-
SERVER-67322 Update the stale comment in TenantMigrationRecipientOpObserver::aboutToDelete()
- Closed