Details

Improvement

Status: Open

Minor

Resolution: Unresolved

Impala 3.0

None

None

ghxlabel5
Description
This is a rollup of a number of minor cleanup tasks for the expression rewriter. None of this stuff is urgent; we know this bit of code has many opportunities for improvement, but we might as well capture what we know.
IMPALA7655 asks to revisit the rewrite rules for several conditional functions. philip suggested that the rewrite rules should apply to all of them. To keep IMPALA7655 focused, the larger review is presented here, along with suggested opportunities to modernize the frontend rewrite rules.
This is the toplevel task for the review tasks, each change is identified by a subtask or linked task in order to keep each code review task small.
Overview
The full set of conditional functions include:
if(boolean condition, type ifTrue, type ifFalseOrNull) ifnull(type a, type ifNull) isfalse(boolean) isnotfalse(boolean) isnottrue(boolean) isnull(type a, type ifNull) istrue(boolean) nonnullvalue(expression) nullif(expr1,expr2) nullifzero(numeric_expr) nullvalue(expression) nvl(type a, type ifNull) nvl2(type a, type ifNull, type ifNotNull) zeroifnull(numeric_expr)
Turns out conditionals are complex as substantial prior work has gone into optimizations. The FE has a number of transforms that affect specific conditional statements. The BE has additional transforms. To proceed, each operation must be tracked through the system one by one.
The discussion below summarizes the state of each of the Impala conditional functions to identify the path needed to implement the requested changes, and to ensure that the changes don't impact other functionality. We also point out a few outofscope nicetohaves as we go along.
In general, all the action here is in just a few places:
 sqlparser.cup in which syntax is reduced to parse nodes such as functions or operators. The parser unifies certain constructs such as <=> and IS NOT DISTINCT FROM.
 FunctionCallExpr.createExpr() is given a functionlike definition and converts some of them to other forms (decode(), nvl2(, nullif(). A nicetohave would be to move this logic to SimplifyConditionalsRule.apply() so we have a uniform way of doing transforms.
 SimplifyConditionalsRule does a great many transforms of various conditional rules. (We will add more for this task.)
 impala_functions.py in the BE provides a mapping from remaining functions (those not optimized away above) to implementations. All functions listed here are crosscompiled into LLVM along with a generated wrapper function that binds the function to its set of arguments.
 conditionalfunctions.[hcc] handles special case functions that require shortcircuit argument evaluation (isull(), if(), coalesce()). These three functions are never code generated. The goal of this task is to convert these into a code generated for using CASE.
For all expressions, the planner does a check for allconstant expressions (such as NULL IS NOT NULL or (10 = 9) IS TRUE) and replaces them with the result of the expression by using the BE to interpret the partial constantonly expression tree. As a result, the rewrite steps focus on the nontrivial cases that require knowledge of the semantics of a given function.
In the suggestions that follow, we rewrite certain functions into CASE. But, in so doing, we end up evaluating certain terms twice. IMPALA7737 asks to resolve that issue.
Below is a summary of each conditional function that identifies current state and any changes that might be possible.
CASE ...
BE: Interpreted when in the SELECT clause (IMPALA4356). Code generated when in the WHERE clause or in a join.
x IS [NOT] (TRUE  FALSE)
FE, sqlparser.cup: captured as a FunctionCallExpr for the equivalent ISTRUE(x), etc. function.
x IS [NOT] NULL
FE, sqlparser.cup: captured as a IsNullPredicate. (Note that this is the opposite of IS TRUE, etc.)
BE: Cross compiled as a UDF: IsNullPredicate::Is[Not]Null, with wrapper.
IS[NOT](TRUEFALSE)(x)
BE: Implemented in ConditionalFunctions::IsTrue, etc.
NULLIF(expr1, expr2)
FE, FunctionCallExpr: nullif(expr1, expr2) → if(expr1 IS DISTINCT FROM expr2, expr1, NULL)
NULLIF() and NVL2() vanish from the plan after this step. There is no entry for nullify() in impala_functions.py.
Note that the implementation here is different from the docs which suggests that the rewrite uses equality. Both for normal data and nulls. However, the implementation actually will handle the NaN case for floats once IMPALA6661 is fixed:
10 * NULLIF(x, sqrt(1))
The above will produce a NULL if x is NaN, 10 * x otherwise. This is a hidden bonus of the current implementation.
Suggestion:
 Move the rewrite rules from FunctionCallExpr.createExpr() into SimplifyConditionalRules.
 Add a signature for this function to impala_functions.py so that it appears in _impala_builtins.
 Add two simplification rules:
 nullif(NULL, x) → NULL
 nullif(x, NULL) → NULL
 Directly rewrite to a CASE expression:
CASE WHEN expr1 IS DISTINCT FROM expr2 THEN expr1 END
NVL2(expr, ifNotNull, ifNull)
FE, FunctionCallExpr: Rewritten to if(expr IS NOT NULL, ifNotNull, ifNull). nvl2() vanishes from the plan at this point and does not appear in impala_functions.py.
Suggestion:
 Move rewrite from FunctionCallExpr.createExpr() into SimplifyConditionalRules.
 Add a signature for this function to impala_functions.py so that it appears in _impala_builtins.
 Add two simplifications:
 nvl2(null, a, b) → b
 nvl2(nonnulllisteral, a, b) → a
 Directly rewrite to a CASE expression:
CASE WHEN expr IS NOT NULL THEN ifNotNull ELSE ifNull END
[NON]NULLVALUE(x)
An entry in impala_functions.py maps this method to the compiled IS [NOT] NULL operator implementations.
Suggestion: To make impala_functions.py less messy, add a transform to the FE to replace these functions with the operators, and remove the functions' entries from impala_functions.py. This also ensures that all optimization applied to the operators is also done for the functions.
x <=> (TRUE  FALSE  NULL)
x IS [NOT] DISTINCT FROM (TRUE  FALSE  NULL)
FE sqlparser.cup: Parsed (in generic form) into a BinaryPredicate(BinaryPredicate.Operator.(NOT_DISTINCTDISTINCT_FROM)...)
BE: Implemented code generated Operators::NotDistinct_BooleanVal_BooleanVal.
Suggestion: To leverage special Boolean optimizations, rewrite the above to IS(TRUEFALSE)(x) or x IS [NOT] NULL in the planner. (The planner appears to already rewrite expressions such as TRUE <=> x into a canonical form so that the rewrite rules need not handle both versions.)
Note: there is no function equivalent of these functions, they are "invisible" to the user, but are listed as distinctfrom and notdistinct in impala_functions.py.
NULLIFZERO(x)
ZEROIFNULL(x)
BE: Implemented as native functions; code generated with wrapper functions.
ISNULL(a, b)
NVL(a, b)
IFNULL(a, b)
IF(cond, trueExpr, falseExpr)
COALESCE(e1, e2, … en)
See IMPALA7655.
DECODE(expr, search1, result1 [, search2, result2 ...] [, default] )
FE: FunctionCallExpr, CaseExpr: Rewrites decode() to CASE. decode() vanishes from the plan after this step.
See the header of CaseExpr.java for details. Looks like the implementation was done before IS DISTINCT was available:
Example of equivalent CASE for DECODE(foo, 'bar', 1, col, 2, NULL, 3, 4):
CASE WHEN foo = 'bar' THEN 1  no need for IS NULL check WHEN foo IS NULL AND col IS NULL OR foo = col THEN 2 WHEN foo IS NULL THEN 3  no need for equality check ELSE 4 END
Nicetohave: In FE, modify to use <=> (AKA IS NOT DISTINCT):
CASE [WHEN expr <=> searchi THEN resulti]+ [ELSE default]? END
Example:
CASE WHEN foo <=> 'bar' THEN 1 WHEN foo <=> col THEN 2 WHEN foo <=> NULL THEN 3 ELSE 4 END
This expansion (and the original one) evaluates the decode expression multiple times and would benefit from the optimization mentioned earlier.
Note also that decode() can be used to pick out floatingpoint NaN values:
decode(float_col, sqrt(1), 0, float_col)
Here, sqrt(1) is used to create a NaN value because Impala has no NaN constant or function.
As it turns out decode() is a rather special beast because it needs to declare n^2 versions for the full set of types. For this reason, we can't add it to impala_functions.py and thus can't move the rewrite rules. We'll leave it as the lone remaining rewrite in FunctionCallExpr.createExpr().
Suggestion: The current implementation is rather adhoc, probably because of the unusual nature of the types of the arguments to decode(). Would be cleaner to do the rewrite as rewrite rule rather than as an adhoc step when creating an expression. That is, rather than doing the rewrite in FunctionCallExpr.createExpr(), do it in SimplifyConditionalsRule.apply().
Doing this would allow us to add an entry for {[decode()}} in the builtinfunctions table. To handle the odd arguments, create a oneoff ScalarFunction subclass do to the specialized argument matching.
Attachments
Issue Links
 incorporates

IMPALA7754 Expressions sometimes not reanalyzed after rewrite
 Open

IMPALA7793 Do not rewrite CAST(nonnullliteral AS type) exprs
 Open

IMPALA7752 Stmt.reset() doesn't (and can't)
 Open

IMPALA7753 Rewrite engine ignores toplevel expressions in ORDER BY clause
 Open

IMPALA7755 IS [NOT] DISTINCT FROM rewrite rules don't handle aggregates
 Open

IMPALA7781 Clean up adhoc instance of, casts to use predicates in expr rewriter
 Open

IMPALA7785 GROUP BY clause not analyzed prior to rewrite step
 Reopened

IMPALA7740 Incorrect doc description for nvl2()
 Closed

IMPALA7737 Evaluate repeated subexpressions only once
 Open

IMPALA7741 Functions nvl2(), decode(), nullif() not listed in _impala_builtins
 Open

IMPALA7769 Handle CAST(NULL AS type) in rewrites
 Open

IMPALA7750 Prune trivial ELSE clause in CASE simplification
 Open

IMPALA7655 Codegen output for conditional functions (if,isnull, coalesce) is very suboptimal
 Resolved

IMPALA7715 "Impala Conditional Functions" documentation errata
 Closed

IMPALA7739 Errata in documentation of decode() method
 Closed