Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-10789

Early materialize expressions in ScanNode

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: Impala 4.1.0
    • Fix Version/s: Impala 4.1.0
    • Component/s: Backend, Frontend
    • Labels:
      None
    • Epic Color:
      ghx-label-8

      Description

      Impala uses the Late Materialize, to calculate expressions, for example,

      SELECT SUM(col), COUNT(col), MIN(col), MAX(col)
      FROM (
      	SELECT CAST(regexp_extract(string_col, '(\\d+)', 0) AS bigint) col
      	FROM functional_parquet.alltypesagg
      ) t

      The Plan likes

      PLAN-ROOT SINK
      |
      03:AGGREGATE [FINALIZE]
      |  output: sum:merge(col), count:merge(col), min:merge(col), max:merge(col)
      |  row-size=32B cardinality=1
      |
      02:EXCHANGE [UNPARTITIONED]
      |
      01:AGGREGATE
      |  output: sum(CAST(regexp_extract(string_col, '(\\d+)', 0) AS BIGINT)), count(CAST(regexp_extract(string_col, '(\\d+)', 0) AS BIGINT)), min(CAST(regexp_extract(string_col, '(\\d+)', 0) AS BIGINT)), max(CAST(regexp_extract(string_col, '(\\d+)', 0) AS BIGINT))
      |  row-size=32B cardinality=1
      |
      00:SCAN HDFS [functional_parquet.alltypesagg]
         partitions=11/11 files=11 size=464.70KB
         row-size=15B cardinality=11.00K
      

      In the aggregation phase, the expression in the parameters of the aggregation function is evaluated. In this way, the same expression that appears in multiple aggregation functions needs to be evaluated multiple times, which leads to long time consuming, especially for complex expressions, such as regular expressions.

      For analytic functions and contains union all, 

       

      SELECT SUM(int_col) OVER (PARTITION BY id )
      FROM (
          SELECT id
              , CASE
                  WHEN id = 10 THEN tinyint_col
                  WHEN string_col LIKE '%6%' THEN smallint_col
              END AS int_col
          FROM functional_parquet.alltypesagg
          UNION ALL
          SELECT id
              , CASE
                  WHEN id = 10 THEN tinyint_col
                  WHEN string_col LIKE '%6%' THEN smallint_col
              END AS int_col
          FROM functional_parquet.alltypes
      ) t
      

      The plan likes

      PLAN-ROOT SINK
      |
      06:EXCHANGE [UNPARTITIONED]
      |
      04:ANALYTIC
      |  functions: sum(int_col)
      |  partition by: id
      |  row-size=14B cardinality=18.30K
      |
      03:SORT
      |  order by: id ASC NULLS FIRST
      |  row-size=6B cardinality=18.30K
      |
      05:EXCHANGE [HASH(id)]
      |
      00:UNION
      |  row-size=6B cardinality=18.30K
      |
      |--02:SCAN HDFS [functional_parquet.alltypes]
      |     partitions=24/24 files=24 size=189.91KB
      |     row-size=22B cardinality=7.30K
      |
      01:SCAN HDFS [functional_parquet.alltypesagg]
         partitions=11/11 files=11 size=464.70KB
         row-size=24B cardinality=11.00K

      In UnionNode, it will materialize expressions and prune columns.

      Currently UnionNode is single-threaded and ScanNode supports multi-threading, it will  improve query performance if materialize expressions in ScanNode.

      We can specify which expressions require early materialize by hints, and Impala internally determines if the expression can be evaluated in ScanNode.

      SELECT SUM(col), COUNT(col), MIN(col), MAX(col)
      FROM (
      	SELECT CAST(regexp_extract(string_col, '(\\d+)', 0) AS bigint) col/*+materialize_expr*/
      	FROM functional_parquet.alltypesagg
      ) t
      

      This can materialize in ScanNode, but like the follow can't use early matrialize

      SELECT SUM(col)
      FROM (
      	SELECT CASE 
      			WHEN t1.id = 10 THEN t2.tinyint_col
      			ELSE t2.smallint_col
      		END AS col/*+materialize_expr*/
      	FROM functional_parquet.alltypesagg t1
      		JOIN functional_parquet.alltypes t2 ON t1.id = t2.id
      ) t
      

       

        Attachments

          Activity

            People

            • Assignee:
              xqhe Xianqing He
              Reporter:
              xqhe Xianqing He
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated: