When we create a DDLCoordinator a lambda is attached to the getConstructionCompletionFuture here.
Since there is no way to chain the lambda onto the executor that runs the promise of the getConstructionCompletionFuture, so the getInstanceCleanupExecutor() is used as the executor.
However with this the creation of a DDLCoordinator can survive a stepDown - stepUp phase since the cleanupExector never shut down.
In a case where the lambda runs after the _status is set to Recovering in the ShardingDDLCoordinatorService::_onServiceInitialization() but before we load the coordinators to recover in the (async task that is created by) ShardingDDLCoordinatorService::_rebuildService then the _numCoordinatorsToWait is 0 as set in the ShardingDDLCoordinatorService::_onServiceTermination() and this invariant fails.
The fix idea is to use the same executor what is provided by the repl::PrimaryOnlyService as that executor is interrupted on every onSetpDown and joined and recreated in every onStepUp.
Side note: the same issue happens in the completion future as well here
Beside fixing the executor here, in the ShardingDDLCoordinatorService::_onServiceTermination() we have to clear the _numActiveCoordinatorsPerType and call _recoveredOrCoordinatorCompletedCV.notify_all(); as well.
- is caused by
-
SERVER-90330 Creation of DDL coordinator hang indefinetly if executed on secondary node
- Closed