Currently hardcoded to 10,000 for MongoDB < 3.2 but for newer versions it can sample the whole collection. For large collections this is slow and inefficient. Allow a limit to be set before sampling the data and make it configurable so users can further reduce the cost of schema inference.
Behavior
$sample uses one of two methods to obtain N random documents, depending on the size of the collection, the size of N, and $sample’s position in the pipeline.If all the following conditions are met, $sample uses a pseudo-random cursor to select documents:
- $sample is the first stage of the pipeline
- N is less than 5% of the total documents in the collection
- The collection contains more than 100 documents
If any of the above conditions are NOT met, $sample performs a collection scan followed by a random sort to select N documents. In this case, the $sample stage is subject to the sort memory restrictions.