Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-5357

ReduceSinkDeDuplication optimizer pick the wrong keys in pRS-cGBYm-cRS-cGBYr scenario when there are distinct keys in child GBY

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.11.0
    • Fix Version/s: 0.12.0
    • Component/s: Query Processor
    • Labels:
      None
    • Release Note:
      ReduceSinkDeDuplication optimizer pick the wrong keys in pRS-cGBYm-cRS-cGBYr scenario when there are distinct keys in child GBY

      Description

      Example:

      select key, count(distinct value) from (select key, value from src group by key, value) t group by key;
      
      //result
      0 0 NULL
      10  10  NULL
      100 100 NULL
      103 103 NULL
      104 104 NULL
      

      Obviously the result is wrong.

      When we have a simple group by query with a distinct column

      explain select count(distinct value) from src group by key;
      

      The plan is

      STAGE DEPENDENCIES:
        Stage-1 is a root stage
        Stage-0 is a root stage
      
      STAGE PLANS:
        Stage: Stage-1
          Map Reduce
            Alias -> Map Operator Tree:
              src 
                TableScan
                  alias: src
                  Select Operator
                    expressions:
                          expr: key
                          type: string
                          expr: value
                          type: string
                    outputColumnNames: key, value
                    Group By Operator
                      aggregations:
                            expr: count(DISTINCT value)
                      bucketGroup: false
                      keys:
                            expr: key
                            type: string
                            expr: value
                            type: string
                      mode: hash
                      outputColumnNames: _col0, _col1, _col2
                      Reduce Output Operator
                        key expressions:
                              expr: _col0
                              type: string
                              expr: _col1
                              type: string
                        sort order: ++
                        Map-reduce partition columns:
                              expr: _col0
                              type: string
                        tag: -1
                        value expressions:
                              expr: _col2
                              type: bigint
            Reduce Operator Tree:
              Group By Operator
                aggregations:
                      expr: count(DISTINCT KEY._col1:0._col0)
                bucketGroup: false
                keys:
                      expr: KEY._col0
                      type: string
                mode: mergepartial
                outputColumnNames: _col0, _col1
                Select Operator
                  expressions:
                        expr: _col1
                        type: bigint
                  outputColumnNames: _col0
                  File Output Operator
                    compressed: false
                    GlobalTableId: 0
                    table:
                        input format: org.apache.hadoop.mapred.TextInputFormat
                        output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      
        Stage: Stage-0
          Fetch Operator
            limit: -1
      

      The map side GBY also adds the distinct columns (value in this case) to its key columns.

      When RSDedup optimizes a query involving a GBY with distinct keys, if map-side aggregation is enabled, currently it assigns the map-side GBY's key columns to the reduce-side GBY. So, for the example shown at the beginning, after we generate a plan with a single MR job, the second GBY in the reduce-side uses both key and value as its key columns. The correct key column is key.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                chenchun Chun Chen
                Reporter:
                chenchun Chun Chen
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: