in progress...
Summary
SDAM spec specifies that RSM is using the { setVersion, electionId } in that order to detect stale primaries. The motivation for this is that if the protocol version changes (like it happened in 3.2.0) the electionId might not be directly comparable but the setVersion is guaranteed to increment. Details: https://github.com/mongodb/specifications/blob/master/source/server-discovery-and-monitoring/server-discovery-and-monitoring.rst#using-setversion-and-electionid-to-detect-stale-primaries
The problem with that if the failover happens before the former primary was able to get the consensus on setVersion increment the new primary will communicate a decremented setVersion while electionId incremented. The existing SDAM treats this as stale primary, which leads to full cluster outage and requires manual intervention. Details in SERVER-59409.
Drawback: if we need to make non-compatible protocol versions in future, which will make the electionId non monotonical, if will require an additional contingency plan.
Tests: the SDAM updated in head to match new behavior: https://github.com/mongodb/mongo/tree/master/src/mongo/client/sdam/json_tests/sdam_tests
Motivation
Who is the affected end user?
Who are the stakeholders? Divers team, server teams.
How does this affect the end user?
Full cluster outage is possible.
How likely is it that this problem or use case will occur?
It happens in tests all the time.
If the problem does occur, what are the consequences and how severe are they?
Outage.
Is this issue urgent?
Not urgent but high priority.
Is this ticket required by a downstream team?
TBD, might be just normal upgrade path.
Is this ticket only for tests?
No.
- depends on
-
SERVER-59409 Race between reconfig replication and stepup can cause RSM to be stuck in reporting ReplicaSetNoPrimary
- Closed
-
DRIVERS-2196 Sync SDAM tests from mongo server repository
- Closed
- duplicates
-
DRIVERS-2412 SDAM should prioritize electionId over setVersion only on >=6.0 servers
- Implementing
- is related to
-
DRIVERS-2412 SDAM should prioritize electionId over setVersion only on >=6.0 servers
- Implementing
- split to
-
CDRIVER-4203 SDAM should give priority to electionId over setVersion when updating topology
- Closed
-
CSHARP-3934 SDAM should give priority to electionId over setVersion when updating topology
- Closed
-
CXX-2404 SDAM should give priority to electionId over setVersion when updating topology
- Closed
-
GODRIVER-2207 SDAM should give priority to electionId over setVersion when updating topology
- Closed
-
MOTOR-847 SDAM should give priority to electionId over setVersion when updating topology
- Closed
-
NODE-3712 SDAM should give priority to electionId over setVersion when updating topology
- Closed
-
PHPC-2068 SDAM should give priority to electionId over setVersion when updating topology
- Closed
-
PYTHON-2970 SDAM should give priority to electionId over setVersion when updating topology
- Closed
-
RUBY-2829 SDAM should give priority to electionId over setVersion when updating topology
- Closed
-
RUST-1081 SDAM should give priority to electionId over setVersion when updating topology
- Closed
-
JAVA-4375 SDAM should give priority to electionId over setVersion when updating topology
- Closed