[HIVE-12465] Hive might produce wrong results when (outer) joins are merged - ASF JIRA

Details

Type: Bug
Status: Closed
Priority: Blocker
Resolution: Fixed
Affects Version/s: 1.3.0, 2.0.0
Fix Version/s: 1.3.0, 2.0.0
Component/s: None
Labels:
None

Target Version/s:

2.0.0

Description

Consider the following query:

select * from
  (select * from tab where tab.key = 0)a
full outer join
  (select * from tab_part where tab_part.key = 98)b
join
  tab_part c
on a.key = b.key and b.key = c.key;

Hive should execute the full outer join operation (without ON clause) and then the join operation (ON a.key = b.key and b.key = c.key). Instead, it merges both joins, generating the following plan:

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: tab
            filterExpr: (key = 0) (type: boolean)
            Statistics: Num rows: 242 Data size: 22748 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: (key = 0) (type: boolean)
              Statistics: Num rows: 121 Data size: 11374 Basic stats: COMPLETE Column stats: NONE
              Select Operator
                expressions: 0 (type: int), value (type: string), ds (type: string)
                outputColumnNames: _col0, _col1, _col2
                Statistics: Num rows: 121 Data size: 11374 Basic stats: COMPLETE Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: int)
                  sort order: +
                  Map-reduce partition columns: _col0 (type: int)
                  Statistics: Num rows: 121 Data size: 11374 Basic stats: COMPLETE Column stats: NONE
                  value expressions: _col1 (type: string), _col2 (type: string)
          TableScan
            alias: tab_part
            filterExpr: (key = 98) (type: boolean)
            Statistics: Num rows: 500 Data size: 47000 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: (key = 98) (type: boolean)
              Statistics: Num rows: 250 Data size: 23500 Basic stats: COMPLETE Column stats: NONE
              Select Operator
                expressions: 98 (type: int), value (type: string), ds (type: string)
                outputColumnNames: _col0, _col1, _col2
                Statistics: Num rows: 250 Data size: 23500 Basic stats: COMPLETE Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: int)
                  sort order: +
                  Map-reduce partition columns: _col0 (type: int)
                  Statistics: Num rows: 250 Data size: 23500 Basic stats: COMPLETE Column stats: NONE
                  value expressions: _col1 (type: string), _col2 (type: string)
          TableScan
            alias: c
            Statistics: Num rows: 500 Data size: 47000 Basic stats: COMPLETE Column stats: NONE
            Reduce Output Operator
              key expressions: key (type: int)
              sort order: +
              Map-reduce partition columns: key (type: int)
              Statistics: Num rows: 500 Data size: 47000 Basic stats: COMPLETE Column stats: NONE
              value expressions: value (type: string), ds (type: string)
      Reduce Operator Tree:
        Join Operator
          condition map:
               Outer Join 0 to 1
               Inner Join 1 to 2
          keys:
            0 _col0 (type: int)
            1 _col0 (type: int)
            2 key (type: int)
          outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8
          Statistics: Num rows: 1100 Data size: 103400 Basic stats: COMPLETE Column stats: NONE
          File Output Operator
            compressed: false
            Statistics: Num rows: 1100 Data size: 103400 Basic stats: COMPLETE Column stats: NONE
            table:
                input format: org.apache.hadoop.mapred.TextInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

That plan is equivalent to the following query, which is different than the original one:

select * from
  (select * from tab where tab.key = 0)a
full outer join
  (select * from tab_part where tab_part.key = 98)b
on a.key = b.key
join
  tab_part c
on b.key = c.key;

It seems to be a problem in the recognition of join operations that can be merged into a single multijoin operator.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HIVE-12465.01.patch
23/Nov/15 13:41
30 kB
jcamachorodriguez
HIVE-12465.02.patch
24/Nov/15 15:46
59 kB
jcamachorodriguez
HIVE-12465.patch
20/Nov/15 21:08
2 kB
jcamachorodriguez
HIVE-12465-branch1.patch
27/Nov/15 09:32
3 kB
jcamachorodriguez

Issue Links

is related to

HIVE-12017 Do not disable CBO by default when number of joins in a query is equal or less than 1

Closed

Hive might produce wrong results when (outer) joins are merged

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates