-
Type: Bug
-
Resolution: Done
-
Priority: Major - P3
-
Affects Version/s: 2.4.4
-
Component/s: Replication
-
None
-
Environment:Tested on Linux amd64, Ubuntu (10.04 Lucid and 13.04 Raring)
-
Fully Compatible
-
ALL
-
With a healthy replset, failovers via rs.stepDown() take 10+s,
even when the election completes in 1-2s.
After an rs.steDown(), a healthy moderately-loaded replset on a
good network will normally complete an election for an new
primary normally within 1-2s. However, before the new primary can
step up, it has to stop the background sync thread.
Suppose (as is usually/often the case) that the new primary was
syncing from the stepped-down node. When that node stepped down,
it closed all connections, including the connection the new
primary was using to read the oplog. This triggered a DBException
inside of _producerThread, triggering this code:
sethbmsg(str::stream() << "db exception in producer: " << e.toString());
sleepsecs(10);
When the new primary goes to stop the rsBackgroundSync thread,
the thread is caught in the middle of that sleep, and so we end
up literally waiting on a sleep(10) before the cluster has a new
primary.
- is duplicated by
-
SERVER-9944 assumePrimary() waits 10 seconds (for nothing?)
- Closed
-
SERVER-9464 On election, relax priority restriction when another member is fresher
- Closed
- related to
-
SERVER-10225 Replica set failover speed improvement
- Closed