-
Type: Bug
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: None
-
Component/s: Replication
-
None
-
Fully Compatible
-
ALL
-
v4.0, v3.6
-
Storage NYC 2018-07-16, Storage NYC 2018-07-30
-
(copied to CRM)
-
39
The oldest timestamp is only advanced at the end of every batch during oplog replay in initial sync. This means that all dirty content generated by the application of the operations in a single batch will be pinned in cache. If the batch is large enough and the operations are heavy enough this dirty content can exceed eviction_dirty_trigger (default 20% of cache) and the rate of applying operations will become dramatically slower because it has to wait for the dirty data to be reduced below the threshold.
In extreme cases the node can become completely stuck due to full cache preventing a batch from completing and unpinning the data that is keeping the cache full (although I'm not sure if that's a necessary consequence of this or a failure of the lookaside mechanism to keep the node from getting completely stuck.)
This is similar to SERVER-34938, but I believe oplog application during initial sync is a different codepath from normal replication. If not feel free to close as a dup.
- is related to
-
SERVER-34900 initial sync uses different batch limits from steady state replication
- Closed
-
SERVER-34938 Secondary slowdown or hang due to content pinned in cache by single oplog batch
- Closed
-
SERVER-36496 Cache pressure issues during oplog replay in initial sync
- Closed
- related to
-
SERVER-33191 Cache-full hangs on 3.6
- Closed
-
SERVER-36238 replica set startup fails in wt_cache_full.js, initial_sync_wt_cache_full.js, recovery_wt_cache_full.js when journaling is disabled
- Closed