Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-12865

MultiKey Cardinality issues in index intersection

    • Type: Icon: Bug Bug
    • Resolution: Duplicate
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 2.6.0-rc0
    • Component/s: Querying
    • None
    • ALL

      Consider the below Query
      db.data.find({
      ... date: {$gt: ISODate("1970-01-01T00:00:00.000Z")},
      ... tags: {$all: ['youtube', 'abc']},
      ... }).limit(100).sort(

      {date: -1}

      )

      with an index on tags + date. where date will be used for the sort portion.

      Behavior in 2.4 :

      The query would scan the index for "youtube" (because it is the first argument) and then scan that subsequent result set for "abc". Now Imagine "youtube" is a tag in 10M documents, whereas "abc" is a tag in 5. Clearly, we would get much better performance by scanning for "abc" instead of "youtube" initially on the index.

      We can manipulate this in 2.4 if we know something about the cardinality of our data by just making "abc" the first argument in the $all. Many people have used this to successfully turn a 30sec query into a 10ms query.

      Behavior in 2.6.rc0:
      The Query would intersect the index with itself, scanning the 5 "abc" docs, and scanning through as much of the "youtube" docs as necessary until it is sure it has all of the relevant documents for indexes (DiskLoc > last DiskLoc of "abc" docs).

      While this in effect should be much better than the 30sec query on average, it can still be as bad as that and on average will scan about half of the 10M documents that match youtube, which is much worse than our optimization of scanning for "abc" instead. Worse yet, it will yield wildly inconsistent performance, from as high as 30sec to as low as a few ms, even if we know the cardinality differences in our data.

      conclusion

      Naturally, turning intersection off if you know your data's cardinality makes sense for these kinds of queries so that you can cleverly structure them. But when you try to run the queries it seems the behavior has changed in 2.5+ releases. The first argument no longer guarantees that that value is what is used for the index. So swapping "abc" to be the first arg may yield a 10ms query or a 30s query. It is no longer deterministic.

      Essentially an existing workaround that people were using has been removed and replaced with an easier, but less effective solution to this problem (index intersection).

      Clearly this problem would be best addressed by having some sort of cardinality histogram about our data in the DB. But given that that may be a few releases away, having some way to deterministically decide what value to scan the index on after we opt out of index intersection would be nice. Argument order as it was previously is probably the best option.

      I also imagine this same problem exists for queries like

      db.data.find( { a :

      { $lt : 10}

      , b :

      { $gt : 20 }

      })

      If there are millions of records with a<10 but only a handful with b>20 then it would be much faster to simply scan b and then it's subset rather than perform an intersection on the result sets of a & b.

            Assignee:
            Unassigned Unassigned
            Reporter:
            osmar.olivo Osmar Olivo
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: