-
Type: Improvement
-
Resolution: Unresolved
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
Storage Engines
This ticket came out of an issue we saw with logkeeper, where the checkpoint cleanup was causing long checkpoints and stalls. agorrod, haribabu.kommi and I identified areas of improvements, which are as follows:
- Limit the amount of I/O generated by the history cleanup for checkpoints. This could be done by spreading the work across the checkpoints, instead of walking the whole tree each checkpoint. Potentially we could save the location of the walk and resume at the next checkpoint.
- The internal pages the checkpoint reads for cleanup are also queued for urgent eviction. Since a checkpoint can load all the internal pages in a tree, the intention is to not thrash the cache with a non-working set. On the other hand, these pages need to be re-read again for the next checkpoint. We can evaluate whether it would make sense to not evict these pages and instead keep them around.
- We could also add some heuristics for reducing the amount of work we need to do to find the obsolete content. Then instead of re-visiting each internal page at every checkpoint, these heuristics can guide us where to look.
agorrod and haribabu.kommi please feel free to add more or edit as you feel like.