Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-58095

Implement a workaround for cases where SBE has meaningfully different evaluation & typechecking order/semantics

    • Type: Icon: Task Task
    • Resolution: Won't Do
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Query Execution

      Description of the issue

      There are a number of cases where SBE produces meaningfully different results (vs. the classic engine) due to differences in evaluation & typechecking order/semantics. For the purposes of this JIRA ticket, "meaningfully different results" means that one engine throws an error but the other does not (or neither engine throws an error but their outputs differ in a way that would be considered a breaking change).

      Category #1: As a general rule SBE evaluates all of an operator's arguments first (excluding a few special cases), vs. the classic engine which does not always follow this rule.

      In the classic engine, there are several operators where if one argument evaluates to null then the evaluation of the remaining arguments "short-circuits" (i.e. the remaining arguments don't get evaluated) and the operator returns null. SBE, on the other hand, evaluates all of an operator's arguments first before doing anything else as a genearl rule (excluding special cases like $ifNull, $cond, and $switch). Here is a list of operators that are known to have this short-circuiting behavior in the classic engine: $add, $concat, $concatArrays, $dateAdd, $dateDiff, $dateSubtract, $dateTrunc, $indexOfArray, $indexOfBytes, $indexOfCP, $multiply, $round, $setIntersection, $setUnion, $trim, $trunc, $zip. Consider the following example:

      db.c.aggregate([{total: {$add: ["$not_here", {$divide: ["$a", 0]}]}}; // "$not_here" is a non-existent field

      In the classic engine, "$not_here" evaluates to null, and this causes the evaluation of $add's remaining arguments to "short-circuit". '{$divide: ["$a", 0]}' never gets evaluated and the $add operator returns null. In SBE, all of $add's arguments get evaluated first and there is no short-circuiting behavior. "$not_here" gets evaluated and produces null, and then '{$divide: ["$a", 0]}' gets evaluated and it throws a "can't $divide by zero" error.

      Category #2: As a general rule, after evaluating all of an operator's arguments SBE will then check if any of the arguments are null before doing anything else, vs. the classic engine which may perform null checks and other typechecks in a different order.

      $multiply is an example of an operator that can produce meaningfully different results on SBE vs. classic because typechecks are performed in different orders. Consider the following an example:

      db.c.aggregate([{total: {$multiply: ["$a", null]}}]); // "$a" is a string field

      In the classic engine, "$a" evaluates to a string, then $multiply checks if "$a" is null or if it is numeric, and then $multiply throws an error because "$a" is not null and not numeric. In SBE, all of $multiply's arguments are evaluated first, then $multiply checks if any of its arguments are null, and then $multiply returns null because at least one argument was null.

      There are other operators such as $concat that also perform typechecks in different orders. We do not have a complete list at this time.

      Goals for this ticket

      The classic engine's approach to evaluation & typechecking order/semantics is fairly ad-hoc. There seems to be a near-50/50 split in the classic engine between operators that evaluate all their arguments first vs. operators that interleave argument evaluation and argument typechecks. SBE, on the other hand, endeavored to take a more principled approach, where all arguments are evaluated first, and then all arguments are checked for null, and then other typechecks are performed after that.

      Where there are differences between the two engines due to SBE taking a more principled approach, we would like to keep SBE's behavior. However, as shown in the examples above, there are some cases that worked under the classic engine but fail under SBE (and vice versa), and this presents a rare (but non-zero) chance that existing user workloads could be disrupted when migrating from the classic engine to SBE.

      After discussing this issue with Product, the consensus was that we would like to implement a flag that users could enabled to get the old classic-engine behavior for the relevant operators. The idea is that if a user's workload is disrupted when they migrate from the classic engine to SBE, they could enable this flag as a "workaround" in the near team to get the "classic behavior" for these operators keep their workload running for the time being. There are several possible ways this flag could be implemented. One idea is to change SBE to generate slightly different SBE plans when this flag is enabled to mimic the classic engine's behavior. Another idea is that this flag would selectively disable SBE for queries that make use of the affected operators and fallback to the classic engine for such queries.

      Here are the goals of this ticket:

      (1) Make sure we have a complete list of all the meaningful differences due to SBE having different evaluation & typechecking order/semantics (vs. the classic engine). I believe the examples above cover everything that is known at this time, but it would be good to do one more pass to see if we can uncover any other meaningful differences.

      (2) See if there are any differences between SBE vs. classic that are not due to SBE taking a more "principled approach" to evaluation & typechecking order/semantics. If any such cases are identified, change SBE's implementation to match the classic engine.

      (3) Implement a new user-facing flag (which is "off" by default) that can be enabled to get the "classic behavior" for the relevant operators. This includes ensuring we have adequate integration test coverage both for when the flag is enabled and for when it is disabled. We should also determine what documentation is needed, if any.

      (4) Update our fuzzer tests as appropriate. We need to figure out whether we want to have this flag enabled or disabled for the fuzzer tests (or whether we want to run the some of the fuzzers both with the flag enabled and disabled). Given that this new flag will be "off" by default, it would be good to at least have some fuzzer coverage with this flag disabled.

            Assignee:
            backlog-query-execution [DO NOT USE] Backlog - Query Execution
            Reporter:
            andrew.paroski@mongodb.com Drew Paroski
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated:
              Resolved: