Details
-
Bug
-
Status: Closed
-
Blocker
-
Resolution: Fixed
-
0.11.0
-
None
-
ReduceSinkDeDuplication optimizer pick the wrong keys in pRS-cGBYm-cRS-cGBYr scenario when there are distinct keys in child GBY
Description
Example:
select key, count(distinct value) from (select key, value from src group by key, value) t group by key;
//result
0 0 NULL
10 10 NULL
100 100 NULL
103 103 NULL
104 104 NULL
Obviously the result is wrong.
When we have a simple group by query with a distinct column
explain select count(distinct value) from src group by key;
The plan is
STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: src TableScan alias: src Select Operator expressions: expr: key type: string expr: value type: string outputColumnNames: key, value Group By Operator aggregations: expr: count(DISTINCT value) bucketGroup: false keys: expr: key type: string expr: value type: string mode: hash outputColumnNames: _col0, _col1, _col2 Reduce Output Operator key expressions: expr: _col0 type: string expr: _col1 type: string sort order: ++ Map-reduce partition columns: expr: _col0 type: string tag: -1 value expressions: expr: _col2 type: bigint Reduce Operator Tree: Group By Operator aggregations: expr: count(DISTINCT KEY._col1:0._col0) bucketGroup: false keys: expr: KEY._col0 type: string mode: mergepartial outputColumnNames: _col0, _col1 Select Operator expressions: expr: _col1 type: bigint outputColumnNames: _col0 File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1
The map side GBY also adds the distinct columns (value in this case) to its key columns.
When RSDedup optimizes a query involving a GBY with distinct keys, if map-side aggregation is enabled, currently it assigns the map-side GBY's key columns to the reduce-side GBY. So, for the example shown at the beginning, after we generate a plan with a single MR job, the second GBY in the reduce-side uses both key and value as its key columns. The correct key column is key.
Attachments
Attachments
Issue Links
- is related to
-
HIVE-2340 optimize orderby followed by a groupby
- Closed