In diagnosing the root cause for WT-6681, we observed very high cache usage coincident with running checkpoints. In some instances, cache usage spiked to ~433% of the configured cache size. Our initial analysis shows that checkpointing non-history store (HS) pages can generate considerable HS content. As HS file only gets reconciled at the end of the checkpoint and there is no cache size check when inserting new HS contents, the cache usage can spike during checkpoint. Few points to be worked on for this ticket:
1 - What is the role of flag WT_SESSION_IGNORE_CACHE_SIZE in this scenario?
2 - A heuristic that prioritises HS pages for eviction was described in WT-6681 that helped bring down the cache usage down to ~135%. A valid question is why existing heuristics that were designed to prioritise eviction for cache dominating files didn't help?
3 - We never fail checkpoint as of now. But how do we manage cases where checkpoint can not continue because cache is full?
4 - Can we evict HS pages while checkpoint is running? If so, what are the restrictions (e.g., write gen)?
5 - Can we improve urgent eviction mechanism for this scenario?
- depends on
-
SERVER-53708 Excess memory usage during shutdown in durable history tests
- Closed
- is related to
-
WT-6681 Rapid history store growth compared with lookaside in 4.2
- Closed
- related to
-
WT-7106 Increase how often delta encoding is used for history store records
- Closed
-
WT-7168 History store ignores cache size during update heavy workload
- Closed
-
WT-7190 Limit eviction of non-history store pages when checkpoint is operating on history store
- Closed
-
WT-7096 Improve the mechanism that collects cache usage stats for the history store
- Backlog