For non-force stepDown, the TopologyCoordinator::tryToStartStepDown() loop in the stepDown code waits for two things -
1. the primary's lastApplied is majority committed and
2. one of the caught up node is electable.
If either of these conditions is not met, we go into the loop body and wait for only (1) lastApplied being majority committed using the _replicationWaiterList. We only check waiters in the list if optime has advanced for at least one member. I guess the intention of the code might be that the majority wait will unblock again when optime of at least one member is changed so we don't need to busy loop on TopologyCoordinator::tryToStartStepDown() checking for condition 2. But this is problematic when all members have caught up (i.e. condition 1 is fully satisfied and no member's optime can advance any further) but we still have to wait for condition 2. We could add a _doneWaitingForReplication_inlock check before adding to the waiter list. This should work because I think it's part of the contract of the _replicationWaiterList that we should always check if the replication wait is done before adding to the waiter list. To be noted though, this will turn condition 2 into a busy-wait if condition 1 is satisfied before condition 2. But I think this is probably fine in practice. To make things little better before doing continue, we can make the stepdown thread to sleep for 10 milliseconds on an interruptible optCx while not holding the mutex lock.
Ideally, we should have a different mechanism to wait for nodes to be electable. But it is probably not worth the complexity.
- related to
-
SERVER-35058 Don't only rely on heartbeat to signal secondary positions in stepdown command
- Closed