Description
Consider the following query:
select * from (select * from tab where tab.key = 0)a full outer join (select * from tab_part where tab_part.key = 98)b join tab_part c on a.key = b.key and b.key = c.key;
Hive should execute the full outer join operation (without ON clause) and then the join operation (ON a.key = b.key and b.key = c.key). Instead, it merges both joins, generating the following plan:
STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan alias: tab filterExpr: (key = 0) (type: boolean) Statistics: Num rows: 242 Data size: 22748 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (key = 0) (type: boolean) Statistics: Num rows: 121 Data size: 11374 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: 0 (type: int), value (type: string), ds (type: string) outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 121 Data size: 11374 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col0 (type: int) sort order: + Map-reduce partition columns: _col0 (type: int) Statistics: Num rows: 121 Data size: 11374 Basic stats: COMPLETE Column stats: NONE value expressions: _col1 (type: string), _col2 (type: string) TableScan alias: tab_part filterExpr: (key = 98) (type: boolean) Statistics: Num rows: 500 Data size: 47000 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (key = 98) (type: boolean) Statistics: Num rows: 250 Data size: 23500 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: 98 (type: int), value (type: string), ds (type: string) outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 250 Data size: 23500 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col0 (type: int) sort order: + Map-reduce partition columns: _col0 (type: int) Statistics: Num rows: 250 Data size: 23500 Basic stats: COMPLETE Column stats: NONE value expressions: _col1 (type: string), _col2 (type: string) TableScan alias: c Statistics: Num rows: 500 Data size: 47000 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: key (type: int) sort order: + Map-reduce partition columns: key (type: int) Statistics: Num rows: 500 Data size: 47000 Basic stats: COMPLETE Column stats: NONE value expressions: value (type: string), ds (type: string) Reduce Operator Tree: Join Operator condition map: Outer Join 0 to 1 Inner Join 1 to 2 keys: 0 _col0 (type: int) 1 _col0 (type: int) 2 key (type: int) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8 Statistics: Num rows: 1100 Data size: 103400 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 1100 Data size: 103400 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink
That plan is equivalent to the following query, which is different than the original one:
select * from (select * from tab where tab.key = 0)a full outer join (select * from tab_part where tab_part.key = 98)b on a.key = b.key join tab_part c on b.key = c.key;
It seems to be a problem in the recognition of join operations that can be merged into a single multijoin operator.
Attachments
Attachments
Issue Links
- is related to
-
HIVE-12017 Do not disable CBO by default when number of joins in a query is equal or less than 1
- Closed