Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-42551

Support more subexpression elimination cases

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: In Progress
    • Major
    • Resolution: Unresolved
    • 3.3.2
    • None
    • SQL

    Description

      Design Sketch

      How to support more subexpressions elimination cases

      • Get all common expressions from input expressions of the current physical operator to current CodeGenContext. Recursively visits all subexpressions regardless of whether the current expression is a conditional expression.
      • For each common expression:
        • Add a new boolean variable subExprInit to indicate whether it has  already been evaluated. 
        • Add a new code block in CodeGenSupport trait, and reset those subExprInit variables to false before the physical operators begin to evaluate the input row.
        • Add a new wrapper subExpr function for each common subexpression.
      private void subExpr_n(${argList}) {
       if (!subExprInit) {
         ${eval.code}
         subExprInit_n = true;
         subExprIsNull_n = ${eval.isNull};
         subExprValue_n = ${eval.value};
       }
      }

       

      • When generating the input expression code,  if the input expression is a common expression, the expression code will be replaced with the corresponding subExpr function. When the subExpr function is called for the first time, subExprInit will be set to true, and the subsequent function calls will do nothing.

      Why should we support whole-stage subexpression elimination

      Right now each spark physical operator shares nothing but the input row, so the same expressions may be evaluated multiple times across different operators. For example, the expression udf(c1, c2) in plan Project [udf(c1, c2)] - Filter [udf(c1, c2) > 0] - Relation will be evaluated both in Project and Filter operators.  We can reuse the expression results across different operators such as Project and Filter.

      How to support whole-stage subexpression elimination

      • Add two properties in CodegenSupport trait, the reusable expressions and the the output attributes, we can reuse the expression results only if the output attributes are the same.
      • Visit all operators from top to bottom, bound the candidate expressions with the output attributes and add to the current candidate reusable expressions.
      • Visit all operators from bottom to top, collect all the common expressions to the current operator, and add the initialize code to the current operator if the common expressions have not been initialized.
      • Replace the common expressions code when generating codes for  the physical operators.

      New support subexpression elimination patterns

      • Support subexpression elimination with conditional expressions

      SELECT case when v + 2 > 1 then 1
                  when v + 1 > 2 then 2
                  when v + 1 > 3 then 3 END vv
      FROM values(1) as t2(v)
      

      We can reuse the result of expression  v + 1

      SELECT a, max(if(a > 0, b + c, null)) max_bc, min(if(a > 1, b + c, null)) min_bc
      FROM values(1, 1, 1) as t(a, b, c)
      GROUP BY a
      

      We can reuse the result of expression  b + c

      • Support subexpression elimination in FilterExec

       

      SELECT * FROM (
        SELECT v * v + 1 v1 from values(1) as t2(v)
      ) t
      where v1 > 5 and v1 < 10
      

      We can reuse the result of expression  v * v + 1

      • Support subexpression elimination in JoinExec

       

      SELECT * 
      FROM values(1, 1) as t1(a, b) 
      join values(1, 2) as t2(x, y)
      
      ON b * y between 2 and 3

       
      We can reuse the result of expression  b * y

      • Support subexpression elimination in ExpandExec

      SELECT a, count(b),
         	count(distinct case when b > 1 then b + c else null end) as count_bc_1,
         	count(distinct case when b < 0 then b + c else null end) as count_bc_2
      FROM values(1, 1, 1) as t(a, b, c)
      GROUP BY a
      

      We can reuse the result of expression  b + c

      Attachments

        Activity

          People

            Unassigned Unassigned
            wankun Wan Kun
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: