-
Type: Improvement
-
Resolution: Unresolved
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
StorEng - Refinement Pipeline
It would be useful for some server testing to be able have reproducible behavior from random cursors. I.e., to run the same test on the same data and have a random cursor return the same results.
One particular use case is for cardinality estimation. As I understand it, query planning sometimes wants to estimate the number (or fraction) of entries in a collection that match some predicate. So they do this by sampling the collection with a random cursor and estimating based on the fraction of the sample that matches the predicate.
Implementing this is harder than just providing the same random seed to different random cursors. The behavior of a random cursor depends on both the random numbers and the state of the tree – the number of inserted values (which changes dynamically), the shape of the tree, and which pages are in memory and which are on disk.
For the proposed use case, it sounds like it would be OK to assume the tree isn't changing. When using a repeatable random cursor, we could ensure that we make the same decisions regardless of which pages are cached.
The hard part would be to ensure that we have the same tree shape. I.e., if a test inserts N keys, then uses a random cursor to sample those keys. How do we make sure that we get the same behavior in a different test run that inserts the same N keys in a table. The resulting tree contents will be the same. But the tree shape is likely to be different as it depends on non-deterministic things like when checkpoints run and when pages split. Possibly we could handle this by further restricting the usage to tables where all of the inserts happened via bulk load. But I don't know if this would be too constraining.
This request came from a discussion with anton.korshunov@mongodb.com and the EMEA query team.
- is depended on by
-
SERVER-78764 Revisit sampling CE scan-from-start in performance variants
- Closed