Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-19402

Change semantics of sorting by array fields in find and aggregate

    • Type: Icon: Improvement Improvement
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • 3.5.11
    • Affects Version/s: None
    • Component/s: Querying
    • None
    • Minor Change
    • Query 2017-06-19, Query 2017-07-10, Query 2017-07-31

      Issue Status as of July 21, 2017

      ISSUE SUMMARY
      In versions of MongoDB prior to 3.5.11, sorting by a field containing an array suffered from multiple issues:

      • The sort orders for aggregate and find differed in the presence of arrays.
      • The sort order for find depended on the query predicate.
      • The sort order for find depended on the implementation details of the query planner.
      • The sort order for aggregate depended on the set of indexes available.

      For more information, see the Pre-existing Behavior sections below.

      In versions 3.5.11 and newer, the array sort semantics for find and aggregate have been changed to correct these semantic deficiencies, and to be consistent with one another. The new behavior is to sort documents according to the lowest-valued element of the array for ascending sorts, and according to the highest-valued element of the array for descending sorts. For a more detailed description of this behavior, see the Technical Details section below.

      USER IMPACT
      This is a minor breaking change. Applications that currently sort by an array field will experience a different sort order.

      • As a consequence of this change, it is no longer correct for a multikey index to be used to provide a sort on an array field.
      • In such cases, the query plan will instead include a blocking SORT stage.
      • For indexes that have path-level multikey metadata, as described in SERVER-15086, a multikey index will still be able to provide a sort as long as the sorted-on fields never contain an array.
      • Indexes built on the WiredTiger or In-memory storage engines on versions greater than or equal to 3.4.0 will always have path-level multikey metadata. Indexes built prior to 3.4 can be rebuilt on 3.4 or newer WiredTiger/In-memory in order to obtain multikey metadata.

      TECHNICAL DETAILS
      Arrays are now sorted according to their lowest-valued element. For example, consider sorting the following documents by {timestamps: 1}:

      { "_id" : 0, "timestamps" : [ ISODate("2017-07-21T15:31:01Z"), ISODate("2017-07-21T13:31:01Z") ] }
      { "_id" : 1, "timestamps" : [ ISODate("2017-07-15T15:31:01Z"), ISODate("2017-07-21T18:31:01Z") ] }
      

      Document 0 has two timestamps: July 21 2017, 3:31 PM UTC and July 21 2017, 1:31 PM UTC. Document 1 also has two timestamps: July 15 2017, 3:31 PM UTC and July 21 2017 5:31 PM UTC. Document 0 will sort according to its earliest time, 1:31 PM on July 21. Similarly, document 1 will sort according to its July 15 datetime. Therefore document 1 will sort before document 0 in both find and aggregate:

      > db.c.find().sort({timestamps: 1});
      { "_id" : 1, "timestamps" : [ ISODate("2017-07-15T15:31:01Z"), ISODate("2017-07-21T18:31:01Z") ] }
      { "_id" : 0, "timestamps" : [ ISODate("2017-07-21T15:31:01Z"), ISODate("2017-07-21T13:31:01Z") ] }
      
      > db.c.aggregate([{$sort: {timestamps: 1}}]);
      { "_id" : 1, "timestamps" : [ ISODate("2017-07-15T15:31:01Z"), ISODate("2017-07-21T18:31:01Z") ] }
      { "_id" : 0, "timestamps" : [ ISODate("2017-07-21T15:31:01Z"), ISODate("2017-07-21T13:31:01Z") ] }
      

      For a descending sort, the most recent time in the array will be used as the sort key: 3:31 PM on July 21 for document 0 and 5:31 PM on July 21 for document 1. Since the sort is descending these keys are then ordered from most recent to least recent, resulting in document 1 sorting before document 0:

      > db.c.find().sort({timestamps: -1});
      { "_id" : 1, "timestamps" : [ ISODate("2017-07-15T15:31:01Z"), ISODate("2017-07-21T18:31:01Z") ] }
      { "_id" : 0, "timestamps" : [ ISODate("2017-07-21T15:31:01Z"), ISODate("2017-07-21T13:31:01Z") ] }
      
      > db.c.aggregate([{$sort: {timestamps: -1}}])
      { "_id" : 1, "timestamps" : [ ISODate("2017-07-15T15:31:01Z"), ISODate("2017-07-21T18:31:01Z") ] }
      { "_id" : 0, "timestamps" : [ ISODate("2017-07-21T15:31:01Z"), ISODate("2017-07-21T13:31:01Z") ] }
      

      The semantics defining "lowest-valued array element" are identical to those used for index key generation. To determine the sort key for a document d given sort pattern p, we generate the set of index keys K for d by treating p as an index key pattern. The sort key is min(K), where min is determined by the index ordering. Consider the example of sorting the following documents by {"customers.id": -1, "customers.code": 1}:

      {_id: 0, customers:  [{id: 1, code: "X"}, {id: 2, code: "Y"}]}
      {_id: 1, customers:  [{id: 3, code: "W"}, {id: 3, code: "Z"}]}
      

      The index keys for document 0 are (1, "X") and (2, "Y"). The smallest key in this set according to the sort pattern is (2, "Y"), since we are sorting by customers.id descending. Therefore, (2, "Y") is the sort key for document 0. By similar logic, (3, "W") is the sort key for document 1. Comparing these keys, (3, "W") sorts before (2, "Y"), meaning that document 1 sorts first.

      PRE-EXISTING BEHAVIOR FOR FIND
      Prior to this change, the find command's behavior was similar to the above, except the smallest in-bounds element was selected as the sort key. A key was considered in-bounds based on the query predicate. Consider this example:

      // Query.
      db.coll.find({a: {$gte: 0}}).sort({a: 1});
      
      // Document.
      {_id: 0, a: [-3, -2, 2, 3]}
      

      The sort key for this document would be 2, because it is the smallest key that falls within the bounds of the query predicate.

      PRE-EXISTING BEHAVIOR FOR AGGREGATE
      For aggregation sorts prior to this change, the entire array was used as the sort key. The array sort keys were compared element-wise to determine the sort order of the result set. For instance, when sorting these documents in ascending order by field a:

      {_id: 0, a: [3, 1, 5]}
      {_id: 1, a: [3, 4, 0]}
      

      The sort keys would be [3, 1, 5] and [3, 4, 0] respectively. Document 0 sorts first: although the first array elements are equal, the second array element breaks the tie.

      Original description

      When the query sort algorithm is given an array value to sort on, an array element is chosen as the sort key by using the query predicate to generate sort bounds (see this comment in sort.cpp for an explanation about why this is the case). Instead, we should consider changing the semantics of sorting on an array value, such that the entire array is used as the sort key.

      Pros:

      • Sort key generation becomes much less expensive.
      • Query sort order will become more consistent with aggregation sort order.
      • Multiple outstanding bugs in the server related to sorting on an array value could be closed as "Gone away" (e.g. SERVER-8152, SERVER-11878, SERVER-19394, SERVER-19397).
      • Implementation of query sort algorithm could be vastly simplified.
      • It's unclear whether the current array sorting semantics are well-defined (in particular, it's unclear whether a complete solution exists to SERVER-19397).

      Cons:

      • Backwards-breaking.
      • Multi-key indexes would have to be considered ineligible for providing a sort.

            Assignee:
            david.storch@mongodb.com David Storch
            Reporter:
            backlog-server-query Backlog - Query Team (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            13 Start watching this issue

              Created:
              Updated:
              Resolved: