-
Type: Improvement
-
Resolution: Done
-
Priority: Major - P3
-
Affects Version/s: 2.6.7, 3.0.4
-
Component/s: Replication
-
Fully Compatible
-
v3.4
-
Repl 2017-10-02
-
(copied to CRM)
-
0
We just encountered a situation where all secondaries in two of our replica sets had ceased replication, and were 1-2 days behind the primary. This appears to have been caused in part by the fact that the initial oplog query from SECONDARY->PRIMARY times out after 30 seconds, but the oplog query takes > 5 minutes to run. Some searching led me to this JIRA SERVER-6733, where the timeout was reduced from 10 minutes to 30 seconds.
As a workaround, we are building a custom binary with an increased oplog timeout so that the initial oplog query is allowed to complete and so our secondaries have a chance to catch up.
Ideally, this value would be configurable with a flag or configuration option to avoid the need to recompile, and to allow users to customize the timeout for their particular situation.
We have a fairly large oplog:
> db.printReplicationInfo() configured oplog size: 143477.3826171875MB log length start to end: 1620689secs (450.19hrs) oplog first event time: Wed Jul 08 2015 23:11:24 GMT+0000 (UTC) oplog last event time: Mon Jul 27 2015 17:22:53 GMT+0000 (UTC) now: Mon Jul 27 2015 17:22:53 GMT+0000 (UTC)
Here are some sample queries issued by the secondaries that are timing out:
Mon Jul 27 16:32:44.469 [conn5987144] query local.oplog.rs query: { ts: { $gte: Timestamp 1437813467000|94 } } cursorid:1368021807027379 ntoreturn:0 ntoskip:0 nscanned:4205713 nscannedObjects:4205713 keyUpdates:0 numYields:33130 locks(micros) r:38390680 nreturned:101 reslen:25310 1361497ms Mon Jul 27 16:32:45.037 [conn5987146] query local.oplog.rs query: { ts: { $gte: Timestamp 1437813467000|94 } } cursorid:1368020207769978 ntoreturn:0 ntoskip:0 nscanned:4205713 nscannedObjects:4205713 keyUpdates:0 numYields:33131 locks(micros) r:38186447 nreturned:101 reslen:25310 1362020ms
- is duplicated by
-
SERVER-27952 Replication fails under heavy load - Oplog timeout should be configurable
- Closed
- is related to
-
SERVER-6733 Make oplog timeout shorter
- Closed
-
SERVER-26106 Raise oplog socket timeout for rollback
- Closed
- related to
-
SERVER-28005 Oplog query network timeout is less than the maxTimeMs
- Closed
-
SERVER-38973 Allow configuration of timeouts for getMores on oplog for replication
- Backlog