[IMPALA-2826] Outer join w/ old HJ returns wrong results if 0 outer tbl cols referenced - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: Impala 2.3.0
Fix Version/s: Impala 2.10.0
Component/s: Backend
Labels:
- correctness
- downgraded

Target Version:

Product Backlog

Description

We are not handling null tuples correctly in the old HJ for outer joins when there are no columns referenced from the outer table.

For example:

select t1.int_col, t3.c1
from alltypestiny t1
left outer join
(select 0 as c1 from alltypestiny) t3
on t1.int_col = t3.c1 limit 10
+---------+------+
| int_col | c1   |
+---------+------+
| 0       | NULL |
| 0       | NULL |
| 0       | NULL |
| 0       | NULL |
| 0       | NULL |
| 0       | NULL |
| 0       | NULL |
| 0       | NULL |
| 1       | NULL |
| 0       | NULL |
+---------+------+
Fetched 10 row(s) in 0.12s

When the right side of the left outer join has no columns referenced, it produces a null tuple which doesn't work with the plan generated for left outer joins as the planner sets the build expr to if(TupleIsNull(), NULL, 0).

The reason that this works with the PHJ is that the BufferedTupleStream returns a non-null (but still 0 slots) Tuple for the build rows since the tuple isn't supposed to be nullable (see node 3 w/ non-nullable tupleId 1 in the plan below).

explain select t1.int_col, t3.c1
from alltypestiny t1
left outer join
(select 0 as c1 from alltypestiny) t3
on t1.int_col = t3.c1
+---------------------------------------------------------------+
| Explain String                                                |
+---------------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=80.00MB VCores=2      |
|                                                               |
| F02:PLAN FRAGMENT [UNPARTITIONED]                             |
|   04:EXCHANGE [UNPARTITIONED]                                 |
|      hosts=1 per-host-mem=unavailable                         |
|      tuple-ids=0,1N row-size=4B cardinality=8                 |
|                                                               |
| F00:PLAN FRAGMENT [RANDOM]                                    |
|   DATASTREAM SINK [FRAGMENT=F02, EXCHANGE=04, UNPARTITIONED]  |
|   02:HASH JOIN [LEFT OUTER JOIN, BROADCAST]                   |
|   |  hash predicates: t1.int_col = if(TupleIsNull(), NULL, 0) |
|   |  hosts=1 per-host-mem=0B                                  |
|   |  tuple-ids=0,1N row-size=4B cardinality=8                 |
|   |                                                           |
|   |--03:EXCHANGE [BROADCAST]                                  |
|   |     hosts=1 per-host-mem=0B                               |
|   |     tuple-ids=1 row-size=0B cardinality=8                 |
|   |                                                           |
|   00:SCAN HDFS [functional.alltypestiny t1, RANDOM]           |
|      partitions=4/4 files=4 size=460B                         |
|      table stats: 8 rows total                                |
|      column stats: all                                        |
|      hosts=1 per-host-mem=80.00MB                             |
|      tuple-ids=0 row-size=4B cardinality=8                    |
|                                                               |
| F01:PLAN FRAGMENT [RANDOM]                                    |
|   DATASTREAM SINK [FRAGMENT=F00, EXCHANGE=03, BROADCAST]      |
|   01:SCAN HDFS [functional.alltypestiny, RANDOM]              |
|      partitions=4/4 files=4 size=460B                         |
|      table stats: 8 rows total                                |
|      column stats: all                                        |
|      hosts=1 per-host-mem=80.00MB                             |
|      tuple-ids=1 row-size=0B cardinality=8                    |
+---------------------------------------------------------------+
Fetched 33 row(s) in 0.02s

We're confusing what it means to be NULL. It's strange that a tuple that is not supposed to be nullable can be NULL. There are a few things we can consider doing here:
1) Not emitting NULL tuples when the tuple isn't supposed to be nullable. This may have potential perf implications, but it seems like it would really make things easier to reason about.
2) Special NULL handling in cases like this. We found this issue here, but there may be others.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Matthew Jacobs

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 08/Jan/16 19:27

Updated:: 02/Aug/17 19:35

Resolved:: 02/Aug/17 19:35