https://docs.google.com/document/d/1zPurErldtRGkl9COOM8jv_R_SAkfn4KW2Zd0temb71I/edit
- Our streams Memory tracking undercounts by 33%. This undercounting can lead to some pod/OS “out of memory” errors if we’re not careful.
- As part of this work we should extend the testing to other common memory-intensive pipelines. We might be missing other important allocation sites. Our “above the allocator” approach to memory tracking is hard to make fully accurate. We can also consider multiplying the MemoryTracker numbers by 1.X to account for undercounting.
- One important aspect of this issue is: for an $unwind that duplicates strings, we save memory by referencing counting. However when restoring from a checkpoint we duplicate the string memory. This scenario is (in my opinion) not worth optimizing in checkpointing… but I’ve been able to cause a pod/OS OOM with this sort of pipeline during checkpoint restore. We should identify why the MemoryTracker is not catching this scenario.
See the attached results.json[0]["heapProfileBeforeCheckpoint"] for the stacks reported by the heap profiler. Guessing a bit, but we might be undercounting in the string allocation stack here:
"0": "tcmalloc::tcmalloc_internal::SampleifyAllocation<>()", "1": "slow_alloc<>()", "2": "mongo::ValueStorage::putString()", "3": "mongo::ExpressionConcat::evaluate()", "4": "mongo::projection_executor::ProjectionNode::applyExpressions()",
- mentioned in
-
Page Loading...