-
Type: Improvement
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: None
-
Component/s: Querying
-
Fully Compatible
-
Query Optimization 2021-03-22, Query Optimization 2021-04-05, Query Optimization 2021-04-19, Query Optimization 2021-05-03, Query Optimization 2021-05-17
-
(copied to CRM)
The following script generates increasingly large $or queries, where each clause can use the _id index:
(function() { "use strict"; let n = 0; let step = 1000; let iters = 20; function setup() { db.c.drop(); db.c.insert({_id: 0}) } function remove() { let terms = []; for (var i = 0; i < n; i++) { terms.push({_id: i}) } db.c.find({$or: terms}).itcount(); } for (let i = 0; i < iters; i++) { n += step; setup(); let startTime = new Date(); remove(); let endTime = new Date(); let elapsedTime = endTime - startTime; print(n + "\t" + elapsedTime); } }());
As observed in SERVER-33527, the run time of this query grows super-linearly with the number of $or clauses. Profiling indicates that much of the time is being spent in the planner computing the union of two sets of index bounds. This is implemented in IndexBoundsBuilder::unionize():
This function is O(n log n) in the number of intervals being unionized. Since we must maintain the intervals in indexed order, we std::sort() the intervals prior to scanning the list for intervals that can be merged.
The bigger problem, however, is that we process one clause of the $or at a time, unionizing each into the bounds one-by-one. Therefore, we observe that the runtime is quadratic in n. Since n is related to the size of the query, not the size of the data, n is often quite small. However, when an $or query results in many non-overlapping intervals over the same index, IndexBoundsBuilder::unionize() is slow.
I verified that this is indeed the problem by applying the following patch. (Disclaimer: This patch is not a fix. It comments out logic necessary for other queries in order to prove where the time is going.) After applying the patch, the query runtimes are significantly reduced. The 20,0000 clause query, for example, goes from about 40 seconds to 400 milliseconds on my machine.
diff --git a/src/mongo/db/query/index_bounds_builder.cpp b/src/mongo/db/query/index_bounds_builder.cpp index aa91b2c35a..c36c66ae19 100644 --- a/src/mongo/db/query/index_bounds_builder.cpp +++ b/src/mongo/db/query/index_bounds_builder.cpp @@ -266,7 +266,7 @@ void IndexBoundsBuilder::translateAndUnion(const MatchExpression* expr, oilOut->intervals.insert(oilOut->intervals.end(), arg.intervals.begin(), arg.intervals.end()); // Union the appended intervals with the existing ones. - unionize(oilOut); + // unionize(oilOut); } bool typeMatch(const BSONObj& obj) { @@ -954,6 +954,10 @@ void IndexBoundsBuilder::alignBounds(IndexBounds* bounds, const BSONObj& kp, int ++oilIdx; } + for (auto&& oil : bounds->fields) { + std::sort(oil.intervals.begin(), oil.intervals.end(), IntervalComparison); + } + if (!bounds->isValidFor(kp, scanDir)) { log() << "INVALID BOUNDS: " << redact(bounds->toString()) << endl << "kp = " << redact(kp) << endl
- is related to
-
SERVER-35693 Parsing of $in takes quadratic time due to O(n^2) boost::flat_set constructor
- Closed
-
SERVER-33527 Plan with complex filter can hold lock for an extended time without yielding
- Closed
-
SERVER-34029 mongod get stuck after upgrade
- Closed
-
SERVER-62353 Ensure $or -> $in optimization parity
- Closed
- related to
-
SERVER-55639 $in with regex has special meaning
- Backlog
-
SERVER-55509 {example: /regex/} and {example: {$eq: /regex/}} do not mean the same thing
- Backlog
-
SERVER-66379 $or to $in conversion flawed
- Closed
-
SERVER-72450 $or with regex to $in rewrite depends on terms order
- Closed
-
SERVER-37176 IndexBoundsBuilder::unionize() can be skipped for $in
- Closed