-
Type: Bug
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: None
-
Component/s: Replication
-
Fully Compatible
-
ALL
-
v4.4, v4.2
-
Repl 2020-03-23
Unconditional stepdown on learning of higher terms and relinquishing primary due to liveness check can change _leaderMode to kSteppingDown, then unlock the replCoord mutex to continue stepdown. A concurrent reconfig may acquire the lock after that, call _updateMemberStateFromTopologyCoordinator which sets canAcceptNonLocalWrites to the topology coordinator's canAcceptWrites():
bool TopologyCoordinator::canAcceptWrites() const { return _leaderMode == LeaderMode::kMaster; }
Since _leaderMode has been changed, the reconfig thread picks up the half-work done by stepdown and continues to update canAcceptNonLocalWrites to false out of the RSTL X mode.
The contract is canAcceptNonLocalWrites has to be updated in RSTL X mode and is violated here, failing an invariant.
SERVER-45081 works around this by only updating canAcceptNonLocalWrites when RSTL X is acquired, so the work will be left to the stepdown thread.
There are several solutions to fix the issue in a holistic way:
- Move the update of readWriteAbility out of _updateMemberStateFromTopologyCoordinator, so it's only called when changed.
- Don't change _leaderMode to kSteppingDown before acquiring RSTL. We need to rethink the concurrency of stepdown then.
The concurrency rule of _updateMemberStateFromTopologyCoordinator is whenever the topology coordinator states depended by _updateMemberStateFromTopologyCoordinator gets changed, this function should be called within the same lock scope. This issue violates this rule.