Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
Impala 2.3.0, Impala 2.5.0
Description
Impala may return incorrect results for queries that have the following properties:
- There is an INNER JOIN following a series of OUTER JOINs
- The INNER JOIN has an On-clause with a predicate that references at least two tables that are on the nullable side of the preceding OUTER JOINs
Query to repro and its plan:
select 1 from functional.alltypes a left outer join functional.alltypes b on a.id = b.id left outer join functional.alltypes c on b.id = c.id right outer join functional.alltypes d on c.id = d.id inner join functional.alltypes e on b.int_col = c.int_col +-----------------------------------------------------------+ | Explain String | +-----------------------------------------------------------+ | Estimated Per-Host Requirements: Memory=480.04MB VCores=4 | | | | 14:EXCHANGE [UNPARTITIONED] | | | | | 08:NESTED LOOP JOIN [CROSS JOIN, BROADCAST] | | | | | |--13:EXCHANGE [BROADCAST] | | | | | | | 04:SCAN HDFS [functional.alltypes e] | | | partitions=24/24 files=24 size=478.45KB | | | | | 07:HASH JOIN [RIGHT OUTER JOIN, PARTITIONED] | | | hash predicates: c.id = d.id | | | runtime filters: RF000 <- d.id | | | | | |--12:EXCHANGE [HASH(d.id)] | | | | | | | 03:SCAN HDFS [functional.alltypes d] | | | partitions=24/24 files=24 size=478.45KB | | | | | 06:HASH JOIN [LEFT OUTER JOIN, PARTITIONED] | | | hash predicates: b.id = c.id | | | other predicates: b.int_col = c.int_col <--- incorrect placement; should be at node 07 or 08 | | | runtime filters: RF001 <- c.int_col | | | | | |--11:EXCHANGE [HASH(c.id)] | | | | | | | 02:SCAN HDFS [functional.alltypes c] | | | partitions=24/24 files=24 size=478.45KB | | | runtime filters: RF000 -> c.id | | | | | 05:HASH JOIN [RIGHT OUTER JOIN, PARTITIONED] | | | hash predicates: b.id = a.id | | | runtime filters: RF002 <- a.id | | | | | |--10:EXCHANGE [HASH(a.id)] | | | | | | | 00:SCAN HDFS [functional.alltypes a] | | | partitions=24/24 files=24 size=478.45KB | | | | | 09:EXCHANGE [HASH(b.id)] | | | | | 01:SCAN HDFS [functional.alltypes b] | | partitions=24/24 files=24 size=478.45KB | | runtime filters: RF001 -> b.int_col, RF002 -> b.id | +-----------------------------------------------------------+
Workaround
For some queries, this problem can be worked around by placing the problematic On-clause predicate in the WHERE clause instead, or changing the preceding OUTER JOINS to INNER JOINS (if the On-clause predicate would discard NULLs).
To fix the example query above:
select 1 from functional.alltypes a left outer join functional.alltypes b on a.id = b.id left outer join functional.alltypes c on b.id = c.id right outer join functional.alltypes d on c.id = d.id inner join functional.alltypes e where b.int_col = c.int_col +-----------------------------------------------------------+ | Explain String | +-----------------------------------------------------------+ | Estimated Per-Host Requirements: Memory=480.04MB VCores=4 | | | | 14:EXCHANGE [UNPARTITIONED] | | | | | 08:NESTED LOOP JOIN [CROSS JOIN, BROADCAST] | | | | | |--13:EXCHANGE [BROADCAST] | | | | | | | 04:SCAN HDFS [functional.alltypes e] | | | partitions=24/24 files=24 size=478.45KB | | | | | 07:HASH JOIN [RIGHT OUTER JOIN, PARTITIONED] | | | hash predicates: c.id = d.id | | | other predicates: b.int_col = c.int_col <-- correct assignment | | | runtime filters: RF000 <- d.id | | | | | |--12:EXCHANGE [HASH(d.id)] | | | | | | | 03:SCAN HDFS [functional.alltypes d] | | | partitions=24/24 files=24 size=478.45KB | | | | | 06:HASH JOIN [LEFT OUTER JOIN, PARTITIONED] | | | hash predicates: b.id = c.id | | | | | |--11:EXCHANGE [HASH(c.id)] | | | | | | | 02:SCAN HDFS [functional.alltypes c] | | | partitions=24/24 files=24 size=478.45KB | | | runtime filters: RF000 -> c.id | | | | | 05:HASH JOIN [RIGHT OUTER JOIN, PARTITIONED] | | | hash predicates: b.id = a.id | | | runtime filters: RF001 <- a.id | | | | | |--10:EXCHANGE [HASH(a.id)] | | | | | | | 00:SCAN HDFS [functional.alltypes a] | | | partitions=24/24 files=24 size=478.45KB | | | | | 09:EXCHANGE [HASH(b.id)] | | | | | 01:SCAN HDFS [functional.alltypes b] | | partitions=24/24 files=24 size=478.45KB | | runtime filters: RF001 -> b.id | +-----------------------------------------------------------+