Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-41891

$group on sorted cursor forces full collection scan

    • Type: Icon: Bug Bug
    • Resolution: Duplicate
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 4.0.6
    • Component/s: Aggregation Framework
    • None
    • ALL

      mongodb-osx-x86_64-4.0.6

      I have a collection with ~2 million documents. Each document is uniquely identified by a 'date' field (business date, not upload date). I implemented a historicization system by storing multiple documents with the same 'date' field and, within the same date, picking up the one with the highest _id (which is always auto-generated on insert).

      Collection unique index: [(date, -1), (_id, -1)]

      A typical aggregation pipeline where I fetch the latest version of any number of documents:

      [
           {"$match": { "date": DATE_MATCH},
           {"$sort": SON( [ ("date", -1), ("_id", -1), ] ) },
           {"$group": {"_id": { "date": "$date"}, "doc": {"$first": "$$ROOT"}}},
           {"$replaceRoot": {"newRoot": "$doc"}},
       ]

      The above works very well - as long only a handful of documents are filtered in by the initial $match step.

      But then the query becomes "get the latest version (max id) of the newest 5 documents (max 5 dates)" older than DATE_MAX, without any constraints on the minimum date:

      [
           {"$match": { "date": {"$lt": DATE_MAX}},
           {"$sort": SON( [ ("date", -1), ("_id", -1), ] ) },
           {"$group": {"_id": { "date": "$date"}, "doc": {"$first": "$$ROOT"}}},
           {"$replaceRoot": {"newRoot": "$doc"}},
           {"$sort": SON( [ ("date", -1)] },
           {"$limit": 5},
       ]

      The above functionally does the job, but it is 100x slower than just

      [
           {"$match": { "date": {"$lt": DATE_MAX}},
           {"$sort": SON( [ ("date", -1), ("_id", -1), ] ) },
           {"$limit": 5},
       ]

      which however produces incorrect results as it will return multiple versions of the same document.

      The problem is that after $match I have thousands, if not millions, of documents potentially returned by the cursor - literally everything from the 1980s onwards, which I do want to pick up if there isn't anything newer available.

      One would assume that, just like $sort, $group automatically notices that the results are already sorted exactly like it needs them to, therefore grouping the results as they arrive and releasing RAM as soon as the aggregation key changes. This would mean that as soon as 5 unique dates reach the $limit step, the whole pipeline can be cancelled.

      That is not the case - the timings clearly show that $group is doing a full scan of the whole collection. Note the second $sort; that is necessary because $group is returning dates sorted in ascending order, even if the $sort step before it yields them in descending order.

       

            Assignee:
            eric.sedor@mongodb.com Eric Sedor
            Reporter:
            gimperiale Guido Imperiale
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: