Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-8769

Physical optimizer : Incorrect CE results in a shuffle join instead of a Map join (PK/FK pattern not detected)

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.14.0, 1.0.0, 1.2.0, 1.1.0
    • Fix Version/s: 1.2.1
    • Component/s: Physical Optimizer
    • Labels:
      None

      Description

      TPC-DS Q82 is running slower than hive 13 because the join type is not correct.

      The estimate for item x inventory x date_dim is 227 Million rows while the actual is 3K rows.

      Hive 13 finishes in 753 seconds.
      Hive 14 finishes in 1,267 seconds.
      Hive 14 + force map join finished in 431 seconds.

      Query

      select  i_item_id
             ,i_item_desc
             ,i_current_price
       from item, inventory, date_dim, store_sales
       where i_current_price between 30 and 30+30
       and inv_item_sk = i_item_sk
       and d_date_sk=inv_date_sk
       and d_date between '2002-05-30' and '2002-07-30'
       and i_manufact_id in (437,129,727,663)
       and inv_quantity_on_hand between 100 and 500
       and ss_item_sk = i_item_sk
       group by i_item_id,i_item_desc,i_current_price
       order by i_item_id
       limit 100
      

      Plan

      STAGE PLANS:
        Stage: Stage-1
          Tez
            Edges:
              Map 7 <- Map 1 (BROADCAST_EDGE), Map 2 (BROADCAST_EDGE)
              Reducer 4 <- Map 3 (SIMPLE_EDGE), Map 7 (SIMPLE_EDGE)
              Reducer 5 <- Reducer 4 (SIMPLE_EDGE)
              Reducer 6 <- Reducer 5 (SIMPLE_EDGE)
            DagName: mmokhtar_20141106005353_7a2eb8df-12ff-4fe9-89b4-30f1e4e3fb90:1
            Vertices:
              Map 1 
                  Map Operator Tree:
                      TableScan
                        alias: item
                        filterExpr: ((i_current_price BETWEEN 30 AND 60 and (i_manufact_id) IN (437, 129, 727, 663)) and i_item_sk is not null) (type: boolean)
                        Statistics: Num rows: 462000 Data size: 663862160 Basic stats: COMPLETE Column stats: COMPLETE
                        Filter Operator
                          predicate: ((i_current_price BETWEEN 30 AND 60 and (i_manufact_id) IN (437, 129, 727, 663)) and i_item_sk is not null) (type: boolean)
                          Statistics: Num rows: 115500 Data size: 34185680 Basic stats: COMPLETE Column stats: COMPLETE
                          Select Operator
                            expressions: i_item_sk (type: int), i_item_id (type: string), i_item_desc (type: string), i_current_price (type: float)
                            outputColumnNames: _col0, _col1, _col2, _col3
                            Statistics: Num rows: 115500 Data size: 33724832 Basic stats: COMPLETE Column stats: COMPLETE
                            Reduce Output Operator
                              key expressions: _col0 (type: int)
                              sort order: +
                              Map-reduce partition columns: _col0 (type: int)
                              Statistics: Num rows: 115500 Data size: 33724832 Basic stats: COMPLETE Column stats: COMPLETE
                              value expressions: _col1 (type: string), _col2 (type: string), _col3 (type: float)
                  Execution mode: vectorized
              Map 2 
                  Map Operator Tree:
                      TableScan
                        alias: date_dim
                        filterExpr: (d_date BETWEEN '2002-05-30' AND '2002-07-30' and d_date_sk is not null) (type: boolean)
                        Statistics: Num rows: 73049 Data size: 81741831 Basic stats: COMPLETE Column stats: COMPLETE
                        Filter Operator
                          predicate: (d_date BETWEEN '2002-05-30' AND '2002-07-30' and d_date_sk is not null) (type: boolean)
                          Statistics: Num rows: 36524 Data size: 3579352 Basic stats: COMPLETE Column stats: COMPLETE
                          Select Operator
                            expressions: d_date_sk (type: int)
                            outputColumnNames: _col0
                            Statistics: Num rows: 36524 Data size: 146096 Basic stats: COMPLETE Column stats: COMPLETE
                            Reduce Output Operator
                              key expressions: _col0 (type: int)
                              sort order: +
                              Map-reduce partition columns: _col0 (type: int)
                              Statistics: Num rows: 36524 Data size: 146096 Basic stats: COMPLETE Column stats: COMPLETE
                            Select Operator
                              expressions: _col0 (type: int)
                              outputColumnNames: _col0
                              Statistics: Num rows: 36524 Data size: 146096 Basic stats: COMPLETE Column stats: COMPLETE
                              Group By Operator
                                keys: _col0 (type: int)
                                mode: hash
                                outputColumnNames: _col0
                                Statistics: Num rows: 18262 Data size: 73048 Basic stats: COMPLETE Column stats: COMPLETE
                                Dynamic Partitioning Event Operator
                                  Target Input: inventory
                                  Partition key expr: inv_date_sk
                                  Statistics: Num rows: 18262 Data size: 73048 Basic stats: COMPLETE Column stats: COMPLETE
                                  Target column: inv_date_sk
                                  Target Vertex: Map 7
                  Execution mode: vectorized
              Map 3 
                  Map Operator Tree:
                      TableScan
                        alias: store_sales
                        filterExpr: ss_item_sk is not null (type: boolean)
                        Statistics: Num rows: 82510879939 Data size: 6873789738208 Basic stats: COMPLETE Column stats: COMPLETE
                        Filter Operator
                          predicate: ss_item_sk is not null (type: boolean)
                          Statistics: Num rows: 82510879939 Data size: 330043519756 Basic stats: COMPLETE Column stats: COMPLETE
                          Select Operator
                            expressions: ss_item_sk (type: int)
                            outputColumnNames: _col0
                            Statistics: Num rows: 82510879939 Data size: 330043519756 Basic stats: COMPLETE Column stats: COMPLETE
                            Reduce Output Operator
                              key expressions: _col0 (type: int)
                              sort order: +
                              Map-reduce partition columns: _col0 (type: int)
                              Statistics: Num rows: 82510879939 Data size: 330043519756 Basic stats: COMPLETE Column stats: COMPLETE
                  Execution mode: vectorized
              Map 7 
                  Map Operator Tree:
                      TableScan
                        alias: inventory
                        filterExpr: (inv_quantity_on_hand BETWEEN 100 AND 500 and inv_item_sk is not null) (type: boolean)
                        Statistics: Num rows: 1,627,857,000 Data size: 19208695084 Basic stats: COMPLETE Column stats: COMPLETE
                        Filter Operator
                          predicate: (inv_quantity_on_hand BETWEEN 100 AND 500 and inv_item_sk is not null) (type: boolean)
                          Statistics: Num rows: 813,928,500 Data size: 9604347540 Basic stats: COMPLETE Column stats: COMPLETE
                          Select Operator
                            expressions: inv_item_sk (type: int), inv_date_sk (type: int)
                            outputColumnNames: _col0, _col2
                            Statistics: Num rows: 813,928,500 Data size: 6511428000 Basic stats: COMPLETE Column stats: COMPLETE
                            Map Join Operator
                              condition map:
                                   Inner Join 0 to 1
                              condition expressions:
                                0 {_col2}
                                1 {_col0} {_col1} {_col2} {_col3}
                              keys:
                                0 _col0 (type: int)
                                1 _col0 (type: int)
                              outputColumnNames: _col2, _col3, _col4, _col5, _col6
                              input vertices:
                                1 Map 1
                              Statistics: Num rows: 203,482,128 Data size: 59416781376 Basic stats: COMPLETE Column stats: COMPLETE
                              Map Join Operator
                                condition map:
                                     Inner Join 0 to 1
                                condition expressions:
                                  0 {_col3} {_col4} {_col5} {_col6}
                                  1 
                                keys:
                                  0 _col2 (type: int)
                                  1 _col0 (type: int)
                                outputColumnNames: _col3, _col4, _col5, _col6
                                input vertices:
                                  1 Map 2
                                Statistics: Num rows: 227,514,273 Data size: 66434167716 Basic stats: COMPLETE Column stats: COMPLETE
                                Select Operator
                                  expressions: _col3 (type: int), _col4 (type: string), _col5 (type: string), _col6 (type: float)
                                  outputColumnNames: _col3, _col4, _col5, _col6
                                  Statistics: Num rows: 227514273 Data size: 66434167716 Basic stats: COMPLETE Column stats: COMPLETE
                                  Reduce Output Operator
                                    key expressions: _col3 (type: int)
                                    sort order: +
                                    Map-reduce partition columns: _col3 (type: int)
                                    Statistics: Num rows: 227514273 Data size: 66434167716 Basic stats: COMPLETE Column stats: COMPLETE
                                    value expressions: _col4 (type: string), _col5 (type: string), _col6 (type: float)
                  Execution mode: vectorized
              Reducer 4 
                  Reduce Operator Tree:
                    Merge Join Operator
                      condition map:
                           Inner Join 0 to 1
                      condition expressions:
                        0 
                        1 {VALUE._col3} {VALUE._col4} {VALUE._col5}
                      outputColumnNames: _col5, _col6, _col7
                      Statistics: Num rows: 5156859392 Data size: 1485175504896 Basic stats: COMPLETE Column stats: COMPLETE
                      Select Operator
                        expressions: _col5 (type: string), _col6 (type: string), _col7 (type: float)
                        outputColumnNames: _col0, _col1, _col2
                        Statistics: Num rows: 5156859392 Data size: 1485175504896 Basic stats: COMPLETE Column stats: COMPLETE
                        Group By Operator
                          keys: _col0 (type: string), _col1 (type: string), _col2 (type: float)
                          mode: hash
                          outputColumnNames: _col0, _col1, _col2
                          Statistics: Num rows: 5156859392 Data size: 1485175504896 Basic stats: COMPLETE Column stats: COMPLETE
                          Reduce Output Operator
                            key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: float)
                            sort order: +++
                            Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: float)
                            Statistics: Num rows: 5156859392 Data size: 1485175504896 Basic stats: COMPLETE Column stats: COMPLETE
              Reducer 5 
                  Reduce Operator Tree:
                    Group By Operator
                      keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: float)
                      mode: mergepartial
                      outputColumnNames: _col0, _col1, _col2
                      Statistics: Num rows: 5156859392 Data size: 1485175504896 Basic stats: COMPLETE Column stats: COMPLETE
                      Select Operator
                        expressions: _col0 (type: string), _col1 (type: string), _col2 (type: float)
                        outputColumnNames: _col0, _col1, _col2
                        Statistics: Num rows: 5156859392 Data size: 1485175504896 Basic stats: COMPLETE Column stats: COMPLETE
                        Reduce Output Operator
                          key expressions: _col0 (type: string)
                          sort order: +
                          Statistics: Num rows: 5156859392 Data size: 1485175504896 Basic stats: COMPLETE Column stats: COMPLETE
                          TopN Hash Memory Usage: 0.04
                          value expressions: _col1 (type: string), _col2 (type: float)
                  Execution mode: vectorized
              Reducer 6 
                  Reduce Operator Tree:
                    Select Operator
                      expressions: KEY.reducesinkkey0 (type: string), VALUE._col0 (type: string), VALUE._col1 (type: float)
                      outputColumnNames: _col0, _col1, _col2
                      Statistics: Num rows: 5156859392 Data size: 1485175504896 Basic stats: COMPLETE Column stats: COMPLETE
                      Limit
                        Number of rows: 100
                        Statistics: Num rows: 100 Data size: 28800 Basic stats: COMPLETE Column stats: COMPLETE
                        File Output Operator
                          compressed: false
                          Statistics: Num rows: 100 Data size: 28800 Basic stats: COMPLETE Column stats: COMPLETE
                          table:
                              input format: org.apache.hadoop.mapred.TextInputFormat
                              output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                  Execution mode: vectorized
      
        Stage: Stage-0
          Fetch Operator
            limit: 100
            Processor Tree:
              ListSink
      

      Actual rows counts

      VERTICES         TOTAL_TASKS  FAILED_ATTEMPTS KILLED_TASKS DURATION_SECONDS    CPU_TIME_MILLIS     GC_TIME_MILLIS  INPUT_RECORDS   OUTPUT_RECORDS 
      Map 1                      1                0            0             1.01              1,280                  0        462,000               65
      Map 2                      1                0            0             0.41                400                 23         10,000               62
      Map 3                   2574                0            0           947.79        442,220,640          1,887,714 82,510,879,939   82,510,879,939
      Map 7                      9                0            0           869.22             42,490              1,215     56,133,127            3,081
      Reducer 4               1009                0            0           389.75         69,471,510          1,149,529 82,510,883,020               33
      Reducer 5                253                0            0            93.73            938,930             26,150             33               33
      Reducer 6                  1                0            0             2.08                730                 10             33               33
      

        Attachments

        1. HIVE-8769.01.patch
          4.03 MB
          Pengcheng Xiong
        2. HIVE-8769.02.patch
          3.93 MB
          Pengcheng Xiong
        3. HIVE-8769.03.patch
          3.93 MB
          Pengcheng Xiong
        4. HIVE-8769.04.patch
          3.93 MB
          Pengcheng Xiong
        5. HIVE-8769.05.patch
          3.93 MB
          Pengcheng Xiong

          Issue Links

            Activity

              People

              • Assignee:
                pxiong Pengcheng Xiong
                Reporter:
                mmokhtar Mostafa Mokhtar
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: