Details
-
Sub-task
-
Status: Closed
-
Blocker
-
Resolution: Fixed
-
0.14.0
-
None
-
None
Description
The stats annotation for a group-by only annotates the reduce-side row-count with the distinct values.
The map-side gets the row-count as the rows output instead of distinct * parallelism, while the reducer side gets the correct parallelism.
hive> explain select distinct L_SHIPDATE from lineitem; Vertices: Map 1 Map Operator Tree: TableScan alias: lineitem Statistics: Num rows: 5999989709 Data size: 4745677733354 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: l_shipdate (type: string) outputColumnNames: l_shipdate Statistics: Num rows: 5999989709 Data size: 4745677733354 Basic stats: COMPLETE Column stats: COMPLETE Group By Operator keys: l_shipdate (type: string) mode: hash outputColumnNames: _col0 Statistics: Num rows: 5999989709 Data size: 563999032646 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator key expressions: _col0 (type: string) sort order: + Map-reduce partition columns: _col0 (type: string) Statistics: Num rows: 5999989709 Data size: 563999032646 Basic stats: COMPLETE Column stats: COMPLETE Execution mode: vectorized Reducer 2 Reduce Operator Tree: Group By Operator keys: KEY._col0 (type: string) mode: mergepartial outputColumnNames: _col0 Statistics: Num rows: 1955 Data size: 183770 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: _col0 (type: string) outputColumnNames: _col0 Statistics: Num rows: 1955 Data size: 183770 Basic stats: COMPLETE Column stats: COMPLETE