-
Type: Bug
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: None
-
Component/s: None
-
Service Arch
-
Fully Compatible
-
v6.1
-
Service Arch 2022-05-30, Service Arch 2022-06-13, Service Arch 2022-06-27, Service Arch 2022-07-11, Service Arch 2022-07-25, Service Arch 2023-02-20, Service Arch 2023-03-06, Service Arch 2023-05-01, Service Arch 2023-05-15, Service Arch 2023-05-29, Service Arch 2023-06-12
-
131
-
4
When stepup completes, each Primary-Only Service's _state is set to kRebuilding.
Both PrimaryOnlyService::lookupInstance and PrimaryOnlyService::getOrCreateInstance wait until the _rebuildCV condition variable is notified and the _state is no longer kRebuilding.
_rebuildCV is notified in PrimaryOnlyService::_rebuildInstances, which on stepup is scheduled to run asynchronously.
If stepdown occurs before _rebuildInstances starts, e.g. if stepdown occurs here, then _rebuildCV may never be notified. So, any threads blocking in lookupInstance or getOrCreateInstance that don't get interrupted by stepdown will block indefinitely.
Currently, there is an invariant in lookupInstance that the thread is guaranteed to be interrupted by stepdown. Otherwise, if the thread is holding the RSTL lock, the thread would prevent the stepdown from completing, leading to a deadlock.
It would be better to notify _rebuildCV here to guarantee threads cannot block indefinitely in lookup or getOrCreateInstance.
Acceptance criteria:
Reproduce issue in unit test
Fix as suggested
- is related to
-
SERVER-61717 Ensure a POS instance remains in the POS map until the instance's run() is complete
- Open
-
SERVER-50982 PrimaryOnlyService::lookupInstance should take an OperationContext and use interruptible waits
- Closed
-
SERVER-51518 Reconcile how PrimaryOnlyService::lookupInstance() expects opCtx holding locks to be tagged as always interruptible
- Closed
- related to
-
SERVER-62682 PrimaryOnlyService Does Not Call _rebuildCV.notify_all() leading to calls to waitForConditionOrInterrupt not being triggered
- Closed
-
SERVER-65469 Don't invariant in PrimaryOnlyService::lookupInstance during a lock-free read
- Closed
-
SERVER-68438 Fix PrimaryOnlyService race condition with the PrimaryOnlyServiceClientObserver
- Closed
-
SERVER-68476 Revert 11ed931625c685b453a5244553f1d97c81b80850 from SERVER-51650
- Closed
-
SERVER-66351 Audit uses of OperationContext::setAlwaysInterruptAtStepDownOrUp
- Open