Uploaded image for project: 'IMPALA'
  2. IMPALA-7831

Revisit expression rewriting integration with planner



    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • Impala 3.0
    • None
    • Frontend
    • None
    • ghx-label-1


      The planner performs expression rewriting. It appears that the rewrite engine was added late in planner development, as an add-on step in AnalysisContext after we create the plan. Since that time, it appears that a number of fixes and patches have been applied to work around the inevitable bugs that resulted from this placement of the logic.

      At present, the planner flow, with rewrites, is:

      • Analyze the entire query
      • Assign WHERE clause "conjuncts" to scan nodes, etc.
      • Cerate theĀ full plan
      • Rewrite the SELECT, WHERE, HAVING and GROUP BY clauses
      • Throw away the plan create above and create a new one

      This ticket proposes to adjust the flow to incorporate rewrites earlier in the process, allowing the planner to make a single pass over the query. (Which will solve a number of bugs described in associated tickets.)


      The above logic evolved because of a timing issue: once we assign conjuncts, we have plan nodes that point to the original WHERE clause expressions. We later rewrite these, but we do so by throwing away the original nodes, replacing them with new ones. Since the scan and other nodes still have a pointer to the old version, the rewrites can have no effect.

      To work around this, the code throws away that original plan and replans using the new, rewritten nodes.

      This then creates an interesting issue. We do the full analysis (and plan) because we need the column bindings in order to do the rewrite. Since plan/analysis is implemented as a single black box, rewrites can't be done before planning (no column binding yet) so must be done after (column bindings available, but so is the entire plan.)

      Some expression nodes have incomplete implementations. For example, X BETWEEN Y AND Z does not compute a cost (because it is a "virtual" node: it does not exist at run time, having been rewritten to Y <= X AND X <= Z.) This means that, not only do we throw away the first plan, that first plan was actually wrong: it used incomplete information.

      Thus, in order to get the semantic info needed for rewrites (column bindings), we end up creating an entire plan which we must then discard and rebuild after doing the rewrites (so the planner has the full information.)


      The alternative approach is to integrate expression rewrites into the planner process, rather than doing them from the outside so that we make only a single pass through the planner. In particular:

      • Analyze expressions to create column bindings.
      • Match up SELECT and GROUP BY and other expressions (if required.) GROUP BY points to a SELECT clause node (so it will see rewrites) rather than each SELECT expression (which will be discarded.)
      • Rewrite SELECT and WHERE clause expressions. (Bound GROUP BY expressions will see the rewrites.)
      • Complete the plan as today.

      With this approach, we plan only once, and that plan has a full set of cost information based on the rewritten expressions which the BE will execute.

      The purpose of this ticket is to track this analysis and to later propose a detailed fix.


        Issue Links



              Unassigned Unassigned
              Paul.Rogers Paul Rogers
              0 Vote for this issue
              1 Start watching this issue