Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
Description
Consider the following SQL query:
select id, name from person group by id, name;
and assume that the person table contains the following tuples:
insert into person values (0, 'A') ; insert into person values (1, 'A') ; insert into person values (2, 'B') ; insert into person values (3, 'B') ; insert into person values (4, 'B') ; insert into person values (5, 'C') ;
If we know the number of distinct values (NDV) for all columns in the group by clause then we can infer a lower bound for the total number of rows by taking the maximun NDV of the involved columns.
Currently the query in the scenario above has the following plan:
Vertex dependency in root stage Reducer 2 <- Map 1 (SIMPLE_EDGE) Stage-0 Fetch Operator limit:-1 Stage-1 Reducer 2 vectorized File Output Operator [FS_11] Group By Operator [GBY_10] (rows=3 width=92) Output:["_col0","_col1"],keys:KEY._col0, KEY._col1 <-Map 1 [SIMPLE_EDGE] vectorized SHUFFLE [RS_9] PartitionCols:_col0, _col1 Group By Operator [GBY_8] (rows=3 width=92) Output:["_col0","_col1"],keys:id, name Select Operator [SEL_7] (rows=6 width=92) Output:["id","name"] TableScan [TS_0] (rows=6 width=92) default@person,person,Tbl:COMPLETE,Col:COMPLETE,Output:["id","name"]
Observe that the stats for group by report 3 rows but given that the ID attribute is part of the aggregation the rows cannot be less than 6.
Attachments
Attachments
Issue Links
- links to