Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Invalid
-
3.0.0
Description
Repro
CREATE EXTERNAL TABLE `alltypessmall`( `id` int, `bool_col` boolean, `tinyint_col` tinyint, `smallint_col` smallint, `int_col` int, `bigint_col` bigint, `float_col` float, `double_col` double, `date_string_col` string, `string_col` string, `timestamp_col` timestamp) PARTITIONED BY ( `year` int, `month` int) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ( 'escape.delim'='\\', 'field.delim'=',', 'serialization.format'=',') STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' TBLPROPERTIES ( 'DO_NOT_UPDATE_STATS'='true', 'OBJCAPABILITIES'='EXTREAD,EXTWRITE', 'STATS_GENERATED'='TASK', 'impala.lastComputeStatsTime'='1608312793', 'transient_lastDdlTime'='1608310442') ; insert into alltypessmall partition(year=2002,month=1) values(1, true, 3,3,4,3434,5.4,44.3,'str1','str2', '01-01-2001'); insert into alltypessmall partition(year=2002,month=1) values(1, true, 3,3,4,3434,5.4,44.3,'str1','str2', '01-01-2001'); insert into alltypessmall partition(year=2002,month=1) values(1, true, 3,3,40,3434,5.4,44.3,'str1','str2', '01-01-2001');
Following query should fail but it succeeds
SELECT id FROM alltypessmall WHERE int_col = (SELECT int_col FROM alltypessmall) ORDER BY id;
Explain plan
STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Tez DagId: vgarg_20210106115838_3fe73bf6-66c2-4281-92e8-fd75fd8ad400:17 Edges: Map 1 <- Map 3 (BROADCAST_EDGE), Reducer 4 (BROADCAST_EDGE) Reducer 2 <- Map 1 (SIMPLE_EDGE) Reducer 4 <- Map 3 (CUSTOM_SIMPLE_EDGE) DagName: vgarg_20210106115838_3fe73bf6-66c2-4281-92e8-fd75fd8ad400:17 Vertices: Map 1 Map Operator Tree: TableScan alias: alltypessmall filterExpr: int_col is not null (type: boolean) Statistics: Num rows: 3 Data size: 24 Basic stats: COMPLETE Column stats: COMPLETE Filter Operator predicate: int_col is not null (type: boolean) Statistics: Num rows: 3 Data size: 24 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: id (type: int), int_col (type: int) outputColumnNames: _col0, _col1 Statistics: Num rows: 3 Data size: 24 Basic stats: COMPLETE Column stats: COMPLETE Map Join Operator condition map: Inner Join 0 to 1 keys: 0 1 outputColumnNames: _col0, _col1 input vertices: 1 Reducer 4 Statistics: Num rows: 3 Data size: 24 Basic stats: COMPLETE Column stats: COMPLETE Map Join Operator condition map: Inner Join 0 to 1 keys: 0 _col1 (type: int) 1 _col0 (type: int) outputColumnNames: _col0 input vertices: 1 Map 3 Statistics: Num rows: 9 Data size: 36 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator key expressions: _col0 (type: int) null sort order: z sort order: + Statistics: Num rows: 9 Data size: 36 Basic stats: COMPLETE Column stats: COMPLETE Execution mode: vectorized Map 3 Map Operator Tree: TableScan alias: alltypessmall Statistics: Num rows: 3 Data size: 142 Basic stats: COMPLETE Column stats: COMPLETE Select Operator Statistics: Num rows: 3 Data size: 142 Basic stats: COMPLETE Column stats: COMPLETE Group By Operator aggregations: count() minReductionHashAggr: 0.6666666 mode: hash outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator null sort order: sort order: Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE value expressions: _col0 (type: bigint) Filter Operator predicate: int_col is not null (type: boolean) Statistics: Num rows: 3 Data size: 12 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: int_col (type: int) outputColumnNames: _col0 Statistics: Num rows: 3 Data size: 12 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator key expressions: _col0 (type: int) null sort order: z sort order: + Map-reduce partition columns: _col0 (type: int) Statistics: Num rows: 3 Data size: 12 Basic stats: COMPLETE Column stats: COMPLETE Execution mode: vectorized Reducer 2 Execution mode: vectorized Reduce Operator Tree: Select Operator expressions: KEY.reducesinkkey0 (type: int) outputColumnNames: _col0 Statistics: Num rows: 9 Data size: 36 Basic stats: COMPLETE Column stats: COMPLETE File Output Operator compressed: false Statistics: Num rows: 9 Data size: 36 Basic stats: COMPLETE Column stats: COMPLETE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Reducer 4 Execution mode: vectorized Reduce Operator Tree: Group By Operator aggregations: count(VALUE._col0) mode: mergepartial outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE Filter Operator predicate: (sq_count_check(_col0) <= 1) (type: boolean) Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE Select Operator Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator null sort order: sort order: Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink
Issue is that Map 3/Reducer 4 group by is producing incorrect count. Expected output in this case is 3 but actual seems to be 1.
CBO plan
HiveSortLimit(sort0=[$0], dir0=[ASC]) HiveProject(id=[$0]) HiveJoin(condition=[=($1, $3)], joinType=[inner], algorithm=[none], cost=[{6.0 rows, 0.0 cpu, 0.0 io}]) HiveJoin(condition=[true], joinType=[inner], algorithm=[none], cost=[{4.0 rows, 0.0 cpu, 0.0 io}]) HiveProject(id=[$0], int_col=[$4]) HiveFilter(condition=[IS NOT NULL($4)]) HiveTableScan(table=[[cdpd_20765, alltypessmall]], table:alias=[alltypessmall]) HiveProject(cnt=[$0]) HiveFilter(condition=[<=(sq_count_check($0), 1)]) HiveProject(cnt=[$0]) HiveAggregate(group=[{}], cnt=[COUNT()]) HiveTableScan(table=[[cdpd_20765, alltypessmall]], table:alias=[alltypessmall]) HiveProject(int_col=[$4]) HiveFilter(condition=[IS NOT NULL($4)]) HiveTableScan(table=[[cdpd_20765, alltypessmall]], table:alias=[alltypessmall])
Attachments
Issue Links
- links to