Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-27637

Merging phase of a distributed $sample should recognize input streams are already sorted

    • Type: Icon: Bug Bug
    • Resolution: Duplicate
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Aggregation Framework
    • None
    • Fully Compatible
    • ALL
    • Query 2017-08-21

      When issuing an aggregation starting with a $sample of size N on a sharded collection, we split the $sample into two parts: First to gather a sample of size N on each shard, and second to merge the samples together for a final sample which potentially includes documents from each shard. The first part can be achieved by either (1) doing a full sort of all documents based on injected random values, then taking the top N or (2) doing repeated random cursor walks over an index until we get N unique documents. During approach (2) we inject random values into the documents after-the-fact in such a way that the output documents are still in decreasing order of random value.

      In either case, the documents output from each shard will be in order by a random metadata field, and just need to be merged in a "merge sorted streams" style. This makes the merging half of the $sample stage equivalent to the merging half of a $sort stage, so when splitting a $sample stage we generate a {sample: {size: N}} to run on all the shards, and a {$sort: {sortKey: {$computed0: {meta: "randVal"}}}} to run on the merging shard. This $sort stage should also include the "mergingPresorted" option, so that it can take advantage of the fact that the inputs are already sorted and avoid the need to spill to disk.

      The net impact of this is that today, a $sample issued against a sharded collection can error with a message indicating that the user needs to pass 'allowDiskUse: true' to perform the sort on the merging shard, even though no disk use is required. It should also speed up all distributed {{$sample}}s if we take advantage of the already sorted streams.

            Assignee:
            bernard.gorman@mongodb.com Bernard Gorman
            Reporter:
            charlie.swanson@mongodb.com Charlie Swanson
            Votes:
            1 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated:
              Resolved: