Creation of DDL coordinator is done through the ShardingDDLCoordinator::getOrCreate function.
This function internally calls ShardingDDLCoordinator::waitForRecoveryCompletion to wait for the service to complete recovery and reach a stable state before to create new coordinator. This is to avoid acquisition of DDL lock (perform by each DDL coordinator instance) before all the previously spawned coordinator have been recovered and acquired their respective DDL locks.
The waitForRecoveryCompletion funciton waits until the service reach the _state == kRecovered.
If this function is called while the node is secondary the state will be kPaused and it will not become kRecovered until the node get elected primary again.
Looking closely at this code, I spot another issue. Since we are not holding the _state lock, there is no guarantee that in between:
- Call to waitForRecoveryCompletion()
- And the actual creation of the coordinator
The _state of the service will change back to kRecovering. In fact it could be that after 1. the node steps down (kRecovered -> kPaused) and then step up again (kPaused -> kRecovering) before executing 2.
This second issue is highly unprobable because we would need to execute a full cycle of stepdown and stepup in few milliseconds.
- causes
-
SERVER-91247 Ensure that DDLCoordinator creation does not survive node stepDown-stepUp
- Closed
- is duplicated by
-
SERVER-90628 _shardsvrReshardCollection command doesn't always get interrupted on stepdown
- Closed