Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-9713

CBO : inefficient join order created for left join outer condition

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 0.14.0
    • None
    • CBO
    • None

    Description

      For the query below which is a subset of TPC-DS Query 80, CBO joins catalog_sales with catalog_returns first although the CE of the join is relatively high.
      catalog_sales should be joined with the selective dimension tables first.

      select  cp_catalog_page_id as catalog_page_id,
                sum(cs_ext_sales_price) as sales,
                sum(coalesce(cr_return_amount, 0)) as returns,
                sum(cs_net_profit - coalesce(cr_net_loss, 0)) as profit
        from catalog_sales left outer join catalog_returns on
               (cs_item_sk = cr_item_sk and cs_order_number = cr_order_number),
           date_dim,
           catalog_page,
           item,
           promotion
       where cs_sold_date_sk = d_date_sk
             and d_date between cast('1998-08-04' as date)
                        and (cast('1998-09-04' as date))
              and cs_catalog_page_sk = cp_catalog_page_sk
             and cs_item_sk = i_item_sk
             and i_current_price > 50
             and cs_promo_sk = p_promo_sk
             and p_channel_tv = 'N'
      group by cp_catalog_page_id
      

      Logical plan from CBO debug logs

      2015-02-17 22:34:04,577 DEBUG [main]: parse.CalcitePlanner (CalcitePlanner.java:apply(743)) - Plan After Join Reordering:
      HiveProject(catalog_page_id=[$0], sales=[$1], returns=[$2], profit=[$3]): rowcount = 10590.0, cumulative cost = {8.25242586823495E15 rows, 0.0 cpu, 0.0 io}, id = 1395
        HiveAggregate(group=[{0}], agg#0=[sum($1)], agg#1=[sum($2)], agg#2=[sum($3)]): rowcount = 10590.0, cumulative cost = {8.25242586823495E15 rows, 0.0 cpu, 0.0 io}, id = 1393
          HiveProject($f0=[$14], $f1=[$5], $f2=[coalesce($9, 0)], $f3=[-($6, coalesce($10, 0))]): rowcount = 1.368586152225262E8, cumulative cost = {8.25242586823495E15 rows, 0.0 cpu, 0.0 io}, id = 1391
            HiveJoin(condition=[=($3, $17)], joinType=[inner]): rowcount = 1.368586152225262E8, cumulative cost = {8.25242586823495E15 rows, 0.0 cpu, 0.0 io}, id = 1508
              HiveJoin(condition=[=($2, $15)], joinType=[inner]): rowcount = 2.737172304450524E8, cumulative cost = {8.252425594517495E15 rows, 0.0 cpu, 0.0 io}, id = 1506
                HiveJoin(condition=[=($1, $13)], joinType=[inner]): rowcount = 8.211516913351573E8, cumulative cost = {8.252424773349804E15 rows, 0.0 cpu, 0.0 io}, id = 1504
                  HiveJoin(condition=[=($0, $11)], joinType=[inner]): rowcount = 1.1296953399027347E11, cumulative cost = {8.252311803804096E15 rows, 0.0 cpu, 0.0 io}, id = 1418
                    HiveJoin(condition=[AND(=($2, $7), =($4, $8))], joinType=[left]): rowcount = 8.252311488455487E15, cumulative cost = {3.15348608E8 rows, 0.0 cpu, 0.0 io}, id = 1413
                      HiveProject(cs_sold_date_sk=[$0], cs_catalog_page_sk=[$12], cs_item_sk=[$15], cs_promo_sk=[$16], cs_order_number=[$17], cs_ext_sales_price=[$23], cs_net_profit=[$33]): rowcount = 2.86549727E8, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 1324
                        HiveTableScan(table=[[tpcds_bin_orc_200.catalog_sales]]): rowcount = 2.86549727E8, cumulative cost = {0}, id = 1136
                      HiveProject(cr_item_sk=[$2], cr_order_number=[$16], cr_return_amount=[$18], cr_net_loss=[$26]): rowcount = 2.8798881E7, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 1327
                        HiveTableScan(table=[[tpcds_bin_orc_200.catalog_returns]]): rowcount = 2.8798881E7, cumulative cost = {0}, id = 1137
                    HiveProject(d_date_sk=[$0], d_date=[$2]): rowcount = 1.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 1371
                      HiveFilter(condition=[between(false, $2, CAST('1998-08-04'):DATE, CAST('1998-09-04'):DATE)]): rowcount = 1.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 1369
                        HiveTableScan(table=[[tpcds_bin_orc_200.date_dim]]): rowcount = 73049.0, cumulative cost = {0}, id = 1138
                  HiveProject(cp_catalog_page_sk=[$0], cp_catalog_page_id=[$1]): rowcount = 11718.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 1375
                    HiveTableScan(table=[[tpcds_bin_orc_200.catalog_page]]): rowcount = 11718.0, cumulative cost = {0}, id = 1139
                HiveProject(i_item_sk=[$0], i_current_price=[$5]): rowcount = 16000.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 1381
                  HiveFilter(condition=[>($5, 5E1)]): rowcount = 16000.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 1379
                    HiveTableScan(table=[[tpcds_bin_orc_200.item]]): rowcount = 48000.0, cumulative cost = {0}, id = 1140
              HiveProject(p_promo_sk=[$0], p_channel_tv=[$11]): rowcount = 225.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 1387
                HiveFilter(condition=[=($11, 'N')]): rowcount = 225.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 1385
                  HiveTableScan(table=[[tpcds_bin_orc_200.promotion]]): rowcount = 450.0, cumulative cost = {0}, id = 1141
      

      Explain plan

      STAGE DEPENDENCIES:
        Stage-1 is a root stage
        Stage-0 depends on stages: Stage-1
      
      STAGE PLANS:
        Stage: Stage-1
          Tez
            Edges:
              Map 1 <- Map 2 (BROADCAST_EDGE)
              Map 3 <- Map 1 (BROADCAST_EDGE)
              Map 4 <- Map 3 (BROADCAST_EDGE), Map 6 (BROADCAST_EDGE), Map 7 (BROADCAST_EDGE)
              Reducer 5 <- Map 4 (SIMPLE_EDGE)
            DagName: mmokhtar_20150306141010_d8c1b2d5-f05f-4039-8261-a69b6f18a2ac:1
            Vertices:
              Map 1
                  Map Operator Tree:
                      TableScan
                        alias: catalog_sales
                        filterExpr: (((cs_sold_date_sk is not null and cs_catalog_page_sk is not null) and cs_item_sk is not null) and cs_promo_sk is not null) (type: boolean)
                        Statistics: Num rows: 286549727 Data size: 65825832570 Basic stats: COMPLETE Column stats: COMPLETE
                        Filter Operator
                          predicate: (((cs_sold_date_sk is not null and cs_catalog_page_sk is not null) and cs_item_sk is not null) and cs_promo_sk is not null) (type: boolean)
                          Statistics: Num rows: 285112475 Data size: 7974560516 Basic stats: COMPLETE Column stats: COMPLETE
                          Select Operator
                            expressions: cs_sold_date_sk (type: int), cs_catalog_page_sk (type: int), cs_item_sk (type: int), cs_promo_sk (type: int), cs_order_number (type: int), cs_ext_sales_price (type: float), cs_net_profit (type: float)
                            outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6
                            Statistics: Num rows: 285112475 Data size: 7974560516 Basic stats: COMPLETE Column stats: COMPLETE
                            Map Join Operator
                              condition map:
                                   Left Outer Join0 to 1
                              keys:
                                0 _col2 (type: int), _col4 (type: int)
                                1 _col0 (type: int), _col1 (type: int)
                              outputColumnNames: _col0, _col1, _col2, _col3, _col5, _col6, _col9, _col10
                              input vertices:
                                1 Map 2
                              Statistics: Num rows: 2911 Data size: 93152 Basic stats: COMPLETE Column stats: COMPLETE
                              Reduce Output Operator
                                key expressions: _col0 (type: int)
                                sort order: +
                                Map-reduce partition columns: _col0 (type: int)
                                Statistics: Num rows: 2911 Data size: 93152 Basic stats: COMPLETE Column stats: COMPLETE
                                value expressions: _col1 (type: int), _col2 (type: int), _col3 (type: int), _col5 (type: float), _col6 (type: float), _col9 (type: float), _col10 (type: float)
                  Execution mode: vectorized
              Map 2
                  Map Operator Tree:
                      TableScan
                        alias: catalog_returns
                        filterExpr: cr_item_sk is not null (type: boolean)
                        Statistics: Num rows: 28798881 Data size: 5764329494 Basic stats: COMPLETE Column stats: COMPLETE
                        Filter Operator
                          predicate: cr_item_sk is not null (type: boolean)
                          Statistics: Num rows: 28798881 Data size: 456171072 Basic stats: COMPLETE Column stats: COMPLETE
                          Select Operator
                            expressions: cr_item_sk (type: int), cr_order_number (type: int), cr_return_amount (type: float), cr_net_loss (type: float)
                            outputColumnNames: _col0, _col1, _col2, _col3
                            Statistics: Num rows: 28798881 Data size: 456171072 Basic stats: COMPLETE Column stats: COMPLETE
                            Reduce Output Operator
                              key expressions: _col0 (type: int), _col1 (type: int)
                              sort order: ++
                              Map-reduce partition columns: _col0 (type: int), _col1 (type: int)
                              Statistics: Num rows: 28798881 Data size: 456171072 Basic stats: COMPLETE Column stats: COMPLETE
                              value expressions: _col2 (type: float), _col3 (type: float)
                  Execution mode: vectorized
              Map 3
                  Map Operator Tree:
                      TableScan
                        alias: date_dim
                        filterExpr: (d_date BETWEEN 1998-08-04 AND 1998-09-04 and d_date_sk is not null) (type: boolean)
                        Statistics: Num rows: 73049 Data size: 81741831 Basic stats: COMPLETE Column stats: COMPLETE
                        Filter Operator
                          predicate: (d_date BETWEEN 1998-08-04 AND 1998-09-04 and d_date_sk is not null) (type: boolean)
                          Statistics: Num rows: 36524 Data size: 3579352 Basic stats: COMPLETE Column stats: COMPLETE
                          Select Operator
                            expressions: d_date_sk (type: int)
                            outputColumnNames: _col0
                            Statistics: Num rows: 36524 Data size: 146096 Basic stats: COMPLETE Column stats: COMPLETE
                            Map Join Operator
                              condition map:
                                   Inner Join 0 to 1
                              keys:
                                0 _col0 (type: int)
                                1 _col0 (type: int)
                              outputColumnNames: _col1, _col2, _col3, _col5, _col6, _col9, _col10
                              input vertices:
                                0 Map 1
                              Statistics: Num rows: 1456 Data size: 40768 Basic stats: COMPLETE Column stats: COMPLETE
                              Reduce Output Operator
                                key expressions: _col1 (type: int)
                                sort order: +
                                Map-reduce partition columns: _col1 (type: int)
                                Statistics: Num rows: 1456 Data size: 40768 Basic stats: COMPLETE Column stats: COMPLETE
                                value expressions: _col2 (type: int), _col3 (type: int), _col5 (type: float), _col6 (type: float), _col9 (type: float), _col10 (type: float)
                  Execution mode: vectorized
              Map 4
                  Map Operator Tree:
                      TableScan
                        alias: catalog_page
                        filterExpr: cp_catalog_page_sk is not null (type: boolean)
                        Statistics: Num rows: 11718 Data size: 5400282 Basic stats: COMPLETE Column stats: COMPLETE
                        Filter Operator
                          predicate: cp_catalog_page_sk is not null (type: boolean)
                          Statistics: Num rows: 11718 Data size: 1218672 Basic stats: COMPLETE Column stats: COMPLETE
                          Select Operator
                            expressions: cp_catalog_page_sk (type: int), cp_catalog_page_id (type: string)
                            outputColumnNames: _col0, _col1
                            Statistics: Num rows: 11718 Data size: 1218672 Basic stats: COMPLETE Column stats: COMPLETE
                            Map Join Operator
                              condition map:
                                   Inner Join 0 to 1
                              keys:
                                0 _col1 (type: int)
                                1 _col0 (type: int)
                              outputColumnNames: _col2, _col3, _col5, _col6, _col9, _col10, _col14
                              input vertices:
                                0 Map 3
                              Statistics: Num rows: 1456 Data size: 180544 Basic stats: COMPLETE Column stats: COMPLETE
                              Map Join Operator
                                condition map:
                                     Inner Join 0 to 1
                                keys:
                                  0 _col2 (type: int)
                                  1 _col0 (type: int)
                                outputColumnNames: _col3, _col5, _col6, _col9, _col10, _col14
                                input vertices:
                                  1 Map 6
                                Statistics: Num rows: 486 Data size: 58320 Basic stats: COMPLETE Column stats: COMPLETE
                                Map Join Operator
                                  condition map:
                                       Inner Join 0 to 1
                                  keys:
                                    0 _col3 (type: int)
                                    1 _col0 (type: int)
                                  outputColumnNames: _col5, _col6, _col9, _col10, _col14
                                  input vertices:
                                    1 Map 7
                                  Statistics: Num rows: 243 Data size: 28188 Basic stats: COMPLETE Column stats: COMPLETE
                                  Select Operator
                                    expressions: _col14 (type: string), _col5 (type: float), COALESCE(_col9,0) (type: float), (_col6 - COALESCE(_col10,0)) (type: float)
                                    outputColumnNames: _col0, _col1, _col2, _col3
                                    Statistics: Num rows: 243 Data size: 28188 Basic stats: COMPLETE Column stats: COMPLETE
                                    Group By Operator
                                      aggregations: sum(_col1), sum(_col2), sum(_col3)
                                      keys: _col0 (type: string)
                                      mode: hash
                                      outputColumnNames: _col0, _col1, _col2, _col3
                                      Statistics: Num rows: 121 Data size: 15004 Basic stats: COMPLETE Column stats: COMPLETE
                                      Reduce Output Operator
                                        key expressions: _col0 (type: string)
                                        sort order: +
                                        Map-reduce partition columns: _col0 (type: string)
                                        Statistics: Num rows: 121 Data size: 15004 Basic stats: COMPLETE Column stats: COMPLETE
                                        value expressions: _col1 (type: double), _col2 (type: double), _col3 (type: double)
                  Execution mode: vectorized
              Map 6
                  Map Operator Tree:
                      TableScan
                        alias: item
                        filterExpr: ((i_current_price > 50.0) and i_item_sk is not null) (type: boolean)
                        Statistics: Num rows: 48000 Data size: 68732712 Basic stats: COMPLETE Column stats: COMPLETE
                        Filter Operator
                          predicate: ((i_current_price > 50.0) and i_item_sk is not null) (type: boolean)
                          Statistics: Num rows: 16000 Data size: 127832 Basic stats: COMPLETE Column stats: COMPLETE
                          Select Operator
                            expressions: i_item_sk (type: int)
                            outputColumnNames: _col0
                            Statistics: Num rows: 16000 Data size: 64000 Basic stats: COMPLETE Column stats: COMPLETE
                            Reduce Output Operator
                              key expressions: _col0 (type: int)
                              sort order: +
                              Map-reduce partition columns: _col0 (type: int)
                              Statistics: Num rows: 16000 Data size: 64000 Basic stats: COMPLETE Column stats: COMPLETE
                  Execution mode: vectorized
              Map 7
                  Map Operator Tree:
                      TableScan
                        alias: promotion
                        filterExpr: ((p_channel_tv = 'N') and p_promo_sk is not null) (type: boolean)
                        Statistics: Num rows: 450 Data size: 530848 Basic stats: COMPLETE Column stats: COMPLETE
                        Filter Operator
                          predicate: ((p_channel_tv = 'N') and p_promo_sk is not null) (type: boolean)
                          Statistics: Num rows: 225 Data size: 20025 Basic stats: COMPLETE Column stats: COMPLETE
                          Select Operator
                            expressions: p_promo_sk (type: int)
                            outputColumnNames: _col0
                            Statistics: Num rows: 225 Data size: 900 Basic stats: COMPLETE Column stats: COMPLETE
                            Reduce Output Operator
                              key expressions: _col0 (type: int)
                              sort order: +
                              Map-reduce partition columns: _col0 (type: int)
                              Statistics: Num rows: 225 Data size: 900 Basic stats: COMPLETE Column stats: COMPLETE
                  Execution mode: vectorized
              Reducer 5
                  Reduce Operator Tree:
                    Group By Operator
                      aggregations: sum(VALUE._col0), sum(VALUE._col1), sum(VALUE._col2)
                      keys: KEY._col0 (type: string)
                      mode: mergepartial
                      outputColumnNames: _col0, _col1, _col2, _col3
                      Statistics: Num rows: 121 Data size: 15004 Basic stats: COMPLETE Column stats: COMPLETE
                      File Output Operator
                        compressed: false
                        Statistics: Num rows: 121 Data size: 15004 Basic stats: COMPLETE Column stats: COMPLETE
                        table:
                            input format: org.apache.hadoop.mapred.TextInputFormat
                            output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                            serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      
        Stage: Stage-0
          Fetch Operator
            limit: -1
            Processor Tree:
              ListSink
      

      Attachments

        Activity

          People

            jpullokkaran Laljo John Pullokkaran
            mmokhtar Mostafa Mokhtar
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: