-
Type: Task
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: None
-
Component/s: Testing Infrastructure
-
Fully Compatible
-
v4.0, v3.6
-
TIG 2018-07-02, TIG 2018-07-16
-
27
-
2
The electionTimeoutMillis parameter for the ContinuousStepdown hook, used in the concurrency stepdown suites, is set to 5000. We should increase this per the captured discussion:
> > On 2018/05/30 22:09:12, maxh wrote:
> > > [note] As mentioned in SERVER-34666, I don't think we should shorten the
> > > election timeout as it can lead to an election happening that isn't
> initiated
> > by
> > > the StepdownThread due to heartbeats being delayed. I'm okay with keeping
it
> > > as-is for now because it is consistent with the replica set configuration
> the
> > > JavaScript version would have used; however, I'd like for there to be a
> > > follow-up SERVER ticket to change it.
> > >
> > >
> >
>
https://jira.mongodb.org/browse/SERVER-34666?focusedCommentId=1873407&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-1873407
> >
> > For the followup ticket, do we just want to remove this value and use the
> > default, or set it to a higher timeout?
>
> I'm not sure - I'd like to get some input from Judah on it. I'm currently
> wondering if we really need to avoid setting the election timeout to 24 hours
> when all_nodes_electable=true. We're going to use the replSetStepUp command in
> the Python version of the StepdownThread to cause one of the secondaries to
run
> for election anyway. If for some reason the replSetStepUp command fails, then
> the former primary will try and step back up after 10 seconds on its own
anyway.
>
>
https://github.com/mongodb/mongo/blob/r4.1.0/buildscripts/resmokelib/testing/fixtures/replicaset.py#L149-L154If you only want elections to come from the StepdownThread, then I'd recommend
setting the election timeout to 24 hours. The replSetStepUp command should still
work, and if it fails for some reason, then no other node will try to run for
election. There's no real difference between the default 10 seconds and the
current 5 seconds except for the amount of flakiness you'd expect (not the
existence of flakiness that we're trying to remove completely).
- causes
-
SERVER-36817 replSetFreeze command run by stepdown thread may fail when server is already primary
- Closed
- is related to
-
SERVER-30642 Raise election timeouts as a way to provide more stable replica set test topologies
- Closed
- related to
-
SERVER-36448 Disable election handoff in suites that use the ContinuousStepdown hook
- Closed
-
SERVER-36451 ContinuousStepdown with killing nodes can hang due to not being able to start the primary
- Closed