-
Type: Bug
-
Resolution: Done
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Aggregation Framework
-
ALL
While implementing a feature to handle CSV like input of the form:
A,B,C // header
1,2,3
4,5,6
etc...
We naively implemented it with the following $match condition:
$or: [ { A: 1, B: 2, C: 3}, { A: 4, B: 5, C: 6}, etc... ]
After seeing bad performances/scalability of this approach we tried two alternatives (these are in an aggregation pipeline):
- One with $in:
$project: { computed_obj: { "1": "$A", "2": "$B", "3": "$C" } }, $match: { computed_obj: { $in: [ { "1": 1, "2": 2, "3": 3 }, { "1": 3, "2": 4, "3": 5 }, etc... ] } }
- One with $setIsSubset:
$project: { condition_value: { $setIsSubset: [ { $map: { input: [null], as: "var__", in { "1": "$A", "2": "$B", "3": "$C" } } }, [ {"1": 1, "2": 2, "3": 3}, {"1": 3, "2": 4, "3": 5}, etc... ] ] } }, $match: { condition_value: true }
We found that when starting to have big enough sets the $in approach was in fact slower and not even the same complexity than the $setIsSubset one.
We then noticed that $setIsSubset is using a std::unordered_set whereas $in is using a simple std::set.
Is there a reason why $in is using a std::set over an std::unordered_set?
- related to
-
SERVER-18733 Streamline set cache optimization for set operations
- Backlog