Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-22815

Investigate if there is a better cutoff for an optimized $sample than 5% of the collection

    • Type: Icon: Task Task
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Aggregation Framework
    • Query Optimization
    • Fully Compatible

      When SERVER-19182 was implemented, we chose 5% as the cutoff for when we will switch from the optimized $sampleFromRandomCursor to the normal $sample implementation.

      The $sampleFromRandomCursor implementation will do repeated random walks over a tree (in currently supported storage engines), whereas the $sample implementation will do a full collection scan, then a top-k sort based on an injected random value.

      It is thought that the $sample implementation will be faster after a certain threshold percentage. This is because a collection scan likely has a data access pattern of large sequential reads, where the random tree walks do a bunch of random point accesses. Especially on spinning disks, the former becomes more appealing as you look at a larger and larger percent of the collection.

      We should do some benchmarking to see if 5% is a good cutoff for a variety of setups. It will likely depend on at least the following factors:

      • Storage engine
      • Type of disk
      • Amount of memory
      • Number of documents in the collection
      • Size of documents

      It may be very hard to find a number that is suited to all combinations, but it may be that there is a better choice than 5%.

            Assignee:
            backlog-query-optimization [DO NOT USE] Backlog - Query Optimization
            Reporter:
            charlie.swanson@mongodb.com Charlie Swanson
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated: