Loading...

XML

Word

Printable

JSON

Type: Improvement
Resolution: Unresolved
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
None

Assigned Teams:

Replication
Confidence Status:
None
Work Order:
3

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

During testing we learned that a force replica set reconfig when nodes are down will take a substantially longer time to complete than first anticipated.

the replSetReconfig command connects to every node in series (described in SERVER-16824)
as a result of the timeout added in ~~SERVER-16818~~, isSelf can take up to 30 seconds to give up connecting to a node that is unreachable (https://github.com/mongodb/mongo/blob/1b3b0073a0b436a8a502b612f24fb2bd572772e5/src/mongo/db/repl/isself.cpp#L230)

We'd like to tune that 30 second value by making it configurable through the replSetReconfig command.

In our uses, this command is run when we are certain the affected nodes are in a network partition, and timing out closer to 2-5 seconds is more reasonable. In large clusters, the 30second timeout could extend into several minutes (perhaps several tens of minutes in the case of clusters with lots of non-voting secondaries)

Perhaps along side the maxTimeMs parameter, there could be a connectTimeoutMs parameter?

related to

SERVER-66050 findSelfInConfig should attempt fast path for every HostAndPort before trying slow path

Closed

SERVER-16824 Run isSelf concurrently for all members

Backlog

Assignee:: [DO NOT USE] Backlog - Replication Team
Reporter:: Jack Wearden
Participants:: [DO NOT USE] Backlog - Replication Team, Jack Wearden, Opal Hoyt, Xuerui Fa
Votes:: 0 Vote for this issue
Watchers:: 7 Start watching this issue

Created:: Jan 18 2023 10:58:44 AM UTC
Updated:: Jan 30 2023 07:12:34 PM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates