-
Type: Improvement
-
Resolution: Unresolved
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
Replication
During testing we learned that a force replica set reconfig when nodes are down will take a substantially longer time to complete than first anticipated.
- the replSetReconfig command connects to every node in series (described in SERVER-16824)
- as a result of the timeout added in
SERVER-16818, isSelf can take up to 30 seconds to give up connecting to a node that is unreachable (https://github.com/mongodb/mongo/blob/1b3b0073a0b436a8a502b612f24fb2bd572772e5/src/mongo/db/repl/isself.cpp#L230)
We'd like to tune that 30 second value by making it configurable through the replSetReconfig command.
In our uses, this command is run when we are certain the affected nodes are in a network partition, and timing out closer to 2-5 seconds is more reasonable. In large clusters, the 30second timeout could extend into several minutes (perhaps several tens of minutes in the case of clusters with lots of non-voting secondaries)
Perhaps along side the maxTimeMs parameter, there could be a connectTimeoutMs parameter?
- related to
-
SERVER-66050 findSelfInConfig should attempt fast path for every HostAndPort before trying slow path
- Closed
-
SERVER-16824 Run isSelf concurrently for all members
- Backlog