Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-3586

Passthrough mode for Union ALL operator

    Details

      Description

      The Union operator can be a bottleneck when combining large data sets. The operator can be sped by acting as a passthrough operator when the input and output tuple layout matches exactly.

      For the query below UNION ALL consumed 25% of overall time.

       select count(*)
                  from
                    (select c_last_name,c_first_name,d_date, sum(q18.c3) c3,count(*) c4
                     from
                       (
                        select
                          c_last_name,c_first_name,d_date,1 as c3
                        from
                          (select c_last_name,c_first_name,d_date	
                           from
                             (select c_last_name, c_first_name, d_date, q14.c3 c3, q14.c4 c4
                              from
                                (select
                                   c_last_name,c_first_name,d_date,sum(q13.c3) c3,count(*) c4
                                 from
                                   (
                                    select c_last_name,c_first_name,d_date,1 as c3
                                    from
                                      customer, date_dim, store_sales
                                    where
                                      (d_month_seq between 1215 and 1226) and
                                      (ss_customer_sk = c_customer_sk) and
                                      (ss_sold_date_sk = d_date_sk) 
                                    union all
                                    select c_last_name,c_first_name,d_date,-1 as c3
                                    from
                                      customer, date_dim, catalog_sales 
                                    where
                                      (d_month_seq between 1215 and 1226) and
                                      (cs_bill_customer_sk = c_customer_sk) and
                                      (cs_sold_date_sk = d_date_sk) 
                                   ) as q13
                                 group by
                                   c_last_name,
                                   c_first_name,
                                   d_date
                                ) as q14
                              where
                                ((q14.c4 - case when (q14.c3 >= 0) then q14.c3 else -(q14.c3) end) >= 2)
                             ) as q15
                          ) as q16
                          union all
      					select c_last_name,c_first_name,d_date,-1 as c3
                        	from
                          	customer,date_dim,web_sales 
                        	where
      	                    (d_month_seq between 1215 and 1226) and
      	                    (ws_bill_customer_sk = c_customer_sk) and
      	                    (ws_sold_date_sk = d_date_sk) 
                          ) as q18
                     group by
                       c_last_name,
                       c_first_name,
                       d_date
                    ) as q19
                  where
                    ((q19.c4 - case when (q19.c3 >= 0) then q19.c3 else -(q19.c3) end) >= 2)
      

      Summary

      Operator             #Hosts   Avg Time   Max Time    #Rows  Est. #Rows   Peak Mem  Est. Peak Mem  Detail                         
      ---------------------------------------------------------------------------------------------------------------------------------
      31:AGGREGATE              1  414.293ms  414.293ms        1           1   92.00 KB        -1.00 B  FINALIZE                       
      30:EXCHANGE               1  151.874us  151.874us       20           1          0        -1.00 B  UNPARTITIONED                  
      19:AGGREGATE             20  226.910ms  294.268ms       20           1   12.00 KB       10.00 MB                                 
      29:AGGREGATE             20    7s391ms    8s845ms   44.51M      75.60B    1.13 GB      933.80 GB  FINALIZE                       
      28:EXCHANGE              20  517.932ms  631.483ms  233.99M      75.60B          0              0  HASH(c_last_name,c_first_na... 
      18:AGGREGATE             20   17s312ms   25s953ms  233.99M      75.60B    1.76 GB     6626.13 GB  STREAMING                      
      00:UNION                 20   19s022ms   28s694ms    2.27B      75.60B  948.10 KB              0                                 
      |--17:HASH JOIN          20    3s257ms    4s993ms    2.16B      10.80B    2.06 MB      298.19 KB  INNER JOIN, BROADCAST          
      |  |--27:EXCHANGE        20   20.031us   33.432us      365       7.30K          0              0  BROADCAST                      
      |  |  14:SCAN HDFS        1   20.955ms   20.955ms      365       7.30K    5.86 MB       48.00 MB  tpcds_15000_decimal_parquet... 
      |  16:HASH JOIN          20    7s244ms   10s364ms    2.16B      10.80B  394.04 MB      208.20 MB  INNER JOIN, BROADCAST          
      |  |--26:EXCHANGE        20  177.055ms  198.769ms    3.85M       3.85M          0              0  BROADCAST                      
      |  |  13:SCAN HDFS       18   12.368ms   17.726ms    3.85M       3.85M   35.96 MB       72.00 MB  tpcds_15000_decimal_parquet... 
      |  15:SCAN HDFS          20  557.460ms  825.397ms    2.16B      10.80B  120.83 MB       88.00 MB  tpcds_15000_decimal_parquet... 
      25:AGGREGATE             20   17s407ms   24s977ms  109.12M      64.80B    2.01 GB     5679.60 GB  FINALIZE                       
      24:EXCHANGE              20    1s422ms    2s113ms  593.83M      64.80B          0              0  HASH(c_last_name,c_first_na... 
      12:AGGREGATE             20      1m25s      1m48s  593.83M      64.80B    4.01 GB     5679.60 GB  STREAMING                      
      01:UNION                 20      1m39s       2m3s   12.32B      64.80B  883.75 KB              0                                 
      |--11:HASH JOIN          20    6s108ms    7s925ms    4.26B      21.60B    2.06 MB      298.19 KB  INNER JOIN, BROADCAST          
      |  |--23:EXCHANGE        20   18.296us   32.651us      365       7.30K          0              0  BROADCAST                      
      |  |  08:SCAN HDFS        1   25.291ms   25.291ms      365       7.30K    5.86 MB       48.00 MB  tpcds_15000_decimal_parquet... 
      |  10:HASH JOIN          20   13s701ms   17s430ms    4.26B      21.60B  394.04 MB      208.20 MB  INNER JOIN, BROADCAST          
      |  |--22:EXCHANGE        20  159.064ms  181.949ms    3.85M       3.85M          0              0  BROADCAST                      
      |  |  07:SCAN HDFS       18    8.930ms   11.820ms    3.85M       3.85M   35.96 MB       72.00 MB  tpcds_15000_decimal_parquet... 
      |  09:SCAN HDFS          20  999.371ms    1s237ms    4.27B      21.60B  212.29 MB       88.00 MB  tpcds_15000_decimal_parquet... 
      06:HASH JOIN             20   11s606ms   14s429ms    8.06B      43.20B    2.06 MB      298.19 KB  INNER JOIN, BROADCAST          
      |--21:EXCHANGE           20   16.435us   27.733us      365       7.30K          0              0  BROADCAST                      
      |  03:SCAN HDFS           1   19.860ms   19.860ms      365       7.30K    5.86 MB       48.00 MB  tpcds_15000_decimal_parquet... 
      05:HASH JOIN             20   21s797ms   27s529ms    8.06B      43.20B  394.04 MB      208.20 MB  INNER JOIN, BROADCAST          
      |--20:EXCHANGE           20  152.883ms  181.123ms    3.85M       3.85M          0              0  BROADCAST                      
      |  02:SCAN HDFS          18    8.797ms   12.102ms    3.85M       3.85M   36.02 MB       72.00 MB  tpcds_15000_decimal_parquet... 
      04:SCAN HDFS             20    1s807ms    2s247ms    8.25B      43.20B  212.40 MB       88.00 MB  tpcds_15000_decimal_parquet... 
      

        Issue Links

          Activity

          Hide
          tarasbob Taras Bobrovytsky added a comment -
          commit a50c344077f6c9bbea3d3cbaa2e9146ba20ac9a9
          Author: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
          Date:   Wed Jan 4 14:33:08 2017 -0800
          
              IMPALA-3586: Implement union passthrough
              
              The union node acts as pass through operator and forwards row batches
              from it's children without materializing. This is done in the case
              when the child's tuple layout is identical to union node tuple layout
              and no functions need to be applied to the child row batches.
              
              Removed operand reordering in the FE because it's simpler and safer to
              handle all passthrough children before non-passthrough children in the
              BE. The recent improvements to memory management allowed us to remove
              this requirement.
              
              Testing:
              - Added new planner and end to end tests that cover the new
                functionality.
              - Updated existing tests to reflect the new behavior.
          
          Show
          tarasbob Taras Bobrovytsky added a comment - commit a50c344077f6c9bbea3d3cbaa2e9146ba20ac9a9 Author: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Date: Wed Jan 4 14:33:08 2017 -0800 IMPALA-3586: Implement union passthrough The union node acts as pass through operator and forwards row batches from it's children without materializing. This is done in the case when the child's tuple layout is identical to union node tuple layout and no functions need to be applied to the child row batches. Removed operand reordering in the FE because it's simpler and safer to handle all passthrough children before non-passthrough children in the BE. The recent improvements to memory management allowed us to remove this requirement. Testing: - Added new planner and end to end tests that cover the new functionality. - Updated existing tests to reflect the new behavior.
          Hide
          alex.behm Alexander Behm added a comment -

          commit e9a4077b3589ee0a6debadb4b5efc13e05ace773
          Author: Alex Behm <alex.behm@cloudera.com>
          Date: Thu Oct 20 21:16:33 2016 -0700

          IMPALA-3586: Clean up union-node.h/cc to enable improvements.

          This patch does not address IMPALA-3586, but it cleans up the
          code in union-node.h/cc to make it easier to implement those
          perf improvements.

          The major simplification is to remove conjunct evaluation since
          the planner does not assigns conjuncts to a union-node anymore.
          Conjuncts are always pushed to the union operands.

          Change-Id: Ia5fc23985e8d51acb8a6920717ce4e2f0254fe70
          Reviewed-on: http://gerrit.cloudera.org:8080/4817
          Reviewed-by: Alex Behm <alex.behm@cloudera.com>
          Tested-by: Internal Jenkins

          Show
          alex.behm Alexander Behm added a comment - commit e9a4077b3589ee0a6debadb4b5efc13e05ace773 Author: Alex Behm <alex.behm@cloudera.com> Date: Thu Oct 20 21:16:33 2016 -0700 IMPALA-3586 : Clean up union-node.h/cc to enable improvements. This patch does not address IMPALA-3586 , but it cleans up the code in union-node.h/cc to make it easier to implement those perf improvements. The major simplification is to remove conjunct evaluation since the planner does not assigns conjuncts to a union-node anymore. Conjuncts are always pushed to the union operands. Change-Id: Ia5fc23985e8d51acb8a6920717ce4e2f0254fe70 Reviewed-on: http://gerrit.cloudera.org:8080/4817 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins

            People

            • Assignee:
              tarasbob Taras Bobrovytsky
              Reporter:
              mmokhtar Mostafa Mokhtar
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development