-
Type: Improvement
-
Resolution: Duplicate
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Aggregation Framework, Querying
-
Query 2020-01-13
-
(copied to CRM)
Synopsis:
Because the Query Parser sorts the conditions in a $or clause and then evaluates them in that order, changing the name of the field can affect performance.
Detailed explanation:
I created a collection with 200 documents, all in this format:
{ _id: ObjectId, firstField: true, giantArray: [array of 20,000 members like {ae: intValue}] lastField: true }
The values for giantArray.ae are random values, but the first 19,999 are all > 17.
The last one has value 11.
Single Condition Tests
================
I ran find(qry).count() against the collection, where qry was each of these filters:
1. {firstField: true}
2. {"giantArray.ae": 11}
3. {lastField: true}
I ran each query 21 times. The first time was to warm the cache, and then I recorded how long the next 20 executions took. All queries used a COLLSCAN as the collection has no indexes beyond the default one on _id, and all scanned all 100 documents.
Single Test Results:
==============
- The equality matches on each of the boolean fields (firstField and lastField) alone both executed 20 times in 8 to 18 milliseconds.
- The equality match against the nested array field "giantArray.ae" alone was much slower, with 20 executions taking 12,332 milliseconds.
This is not unexpected, as the singleton boolean fields were very fast to test, whereas all 20,000 array members had to be examined to find the one with value 11.
$or Tests
=======
Next I combined the equality tests on the boolean and array fields into $or joined pairs, in both orders (i.e. {$or: [X, Y]} and {$or: [Y, X]}:
- {$or: [{firstField: true}, {lastField: true}]} - {$or: [{firstField: true}, {"giantArray.ae": 11}]} - {$or: [{"giantArray.ae": 11}, {firstField: true}]} - {$or: [{"giantArray.ae": 11}, {lastField: true}]} - {$or: [{lastField: true}, {firstField: true}]} - {$or: [{lastField: true}, {"giantArray.ae": 11}]}
$or Joined Tests Results:
==================
- Both {$or: [X, Y]} and {$or: [Y, X]} gave roughly the same performance. This is because the Query Planner sorts the members of the $or set when it parses the query. You can see this by comparing the input query with the explain().queryPlanner.parsedQuery values: both {$or: [X, Y]} and {$or: [Y, X]} are parsed to {$or: [X, Y]}.
Here the significance of the field names becomes apparent.
- If {firstField: true} was one of the conditions in the $or, the parser sorted it to be first, and the query executed 20 times in 10 to 24 milliseconds. This is similar to the time to execute queries with
{firstField: true}
by itself.
- In contrast {lastField: true} paired with {"giantArray.ae": 11} (in either order) took over 1200 milliseconds, because the parser sorts the condition on the array field to be first. This is similar to the time to to execute queries with {"giantArray.ae": 11} by itself.
Conclusion:
===========
If the Query Planner did not sort the members of the $or set when it parsed the query, the order of the conditions would affect the order of execution. However, I believe the sort is required so that functionally identical queries can be matched to the same cached plan.
I do not think there is a backend code change to help with this behaviour, as due to MongoDB's flexible schema the Query Parser cannot not know which fields are arrays before examining them. This SERVER ticket might only serve to socument this behaviour, and to pointy out the workaround.
Workaround:
=========
- Give array fields names that start late in the alphabet, so that conditions on them get sorted after conditions on simpler fields.
- duplicates
-
SERVER-45364 Query Planner should estimate cost of each predicate
- Backlog
- related to
-
SERVER-37530 Provide a way to cause a well-defined order of evaluation for predicates
- Backlog