Details
-
Improvement
-
Status: In Progress
-
Major
-
Resolution: Unresolved
-
3.3.2
-
None
Description
Design Sketch
How to support more subexpressions elimination cases
- Get all common expressions from input expressions of the current physical operator to current CodeGenContext. Recursively visits all subexpressions regardless of whether the current expression is a conditional expression.
- For each common expression:
- Add a new boolean variable subExprInit to indicate whether it has already been evaluated.
- Add a new code block in CodeGenSupport trait, and reset those subExprInit variables to false before the physical operators begin to evaluate the input row.
- Add a new wrapper subExpr function for each common subexpression.
private void subExpr_n(${argList}) { if (!subExprInit) { ${eval.code} subExprInit_n = true; subExprIsNull_n = ${eval.isNull}; subExprValue_n = ${eval.value}; } } |
- When generating the input expression code, if the input expression is a common expression, the expression code will be replaced with the corresponding subExpr function. When the subExpr function is called for the first time, subExprInit will be set to true, and the subsequent function calls will do nothing.
Why should we support whole-stage subexpression elimination
Right now each spark physical operator shares nothing but the input row, so the same expressions may be evaluated multiple times across different operators. For example, the expression udf(c1, c2) in plan Project [udf(c1, c2)] - Filter [udf(c1, c2) > 0] - Relation will be evaluated both in Project and Filter operators. We can reuse the expression results across different operators such as Project and Filter.
How to support whole-stage subexpression elimination
- Add two properties in CodegenSupport trait, the reusable expressions and the the output attributes, we can reuse the expression results only if the output attributes are the same.
- Visit all operators from top to bottom, bound the candidate expressions with the output attributes and add to the current candidate reusable expressions.
- Visit all operators from bottom to top, collect all the common expressions to the current operator, and add the initialize code to the current operator if the common expressions have not been initialized.
- Replace the common expressions code when generating codes for the physical operators.
New support subexpression elimination patterns
SELECT case when v + 2 > 1 then 1
when v + 1 > 2 then 2
when v + 1 > 3 then 3 END vv
FROM values(1) as t2(v)
We can reuse the result of expression v + 1
SELECT a, max(if(a > 0, b + c, null)) max_bc, min(if(a > 1, b + c, null)) min_bc FROM values(1, 1, 1) as t(a, b, c) GROUP BY a
We can reuse the result of expression b + c
SELECT * FROM ( SELECT v * v + 1 v1 from values(1) as t2(v) ) t where v1 > 5 and v1 < 10
We can reuse the result of expression v * v + 1
SELECT * FROM values(1, 1) as t1(a, b) join values(1, 2) as t2(x, y) ON b * y between 2 and 3
We can reuse the result of expression b * y
SELECT a, count(b), count(distinct case when b > 1 then b + c else null end) as count_bc_1, count(distinct case when b < 0 then b + c else null end) as count_bc_2 FROM values(1, 1, 1) as t(a, b, c) GROUP BY a
We can reuse the result of expression b + c
Attachments
Issue Links
- is related to
-
SPARK-44111 Prepare Apache Spark 4.0.0
- Open
- links to