[IMPALA-5504] wrong results with LEFT JOIN, inline view, and COALESCE() - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: Impala 2.8.0, Impala 2.9.0
Fix Version/s: Impala 2.10.0
Component/s: Frontend
Labels:
- correctness
- query_generator

Target Version:

Impala 2.10.0
Epic Color:
ghx-label-4

Description

Summary

This query

USE tpch;
SELECT
COUNT(t1.ps_suppkey)
FROM partsupp t1
LEFT JOIN  (
SELECT
COALESCE(t2.o_custkey, -1) AS coalesce_col
FROM orders t2
LEFT JOIN partsupp t3 ON (t2.o_shippriority) = (t3.ps_partkey)
) sq ON ((t1.ps_suppkey) = (sq.coalesce_col))

produces a different result compared to PostgreSQL. This occurs in both 2.8- and 2.9-based builds, so it doesn't seem to be a recent regression.

The problem is the COALESCE() function in the inline view and its relationship to the LEFT JOIN in the outer query.

Analysis

Because t2.o_custkey is a primary key column and does not contain NULL values:

[localhost:21000] > select count(o_custkey) from orders where o_custkey is null;
+------------------+
| count(o_custkey) |
+------------------+
| 0                |
+------------------+
Fetched 1 row(s) in 0.64s
[localhost:21000] >

This means I can remove the use of COALESCE(t2.o_custkey, -1) and just replace it with t2.o_custkey:

SELECT
COUNT(t1.ps_suppkey) c
FROM partsupp t1
LEFT JOIN  (
SELECT
t2.o_custkey AS coalesce_col
FROM orders t2
LEFT JOIN partsupp t3 ON (t2.o_shippriority) = (t3.ps_partkey)
) sq ON ((t1.ps_suppkey) = (sq.coalesce_col))
ORDER BY c

When I do that, the results end up matching Postgres. This is my grounds for justifying this as an Impala bug, not a Postgres bug: either with or without the COALESCE(), the results should have been the same.

Consider the query profiles for the query without the COALESCE, that produces the correct results:

Operator          #Hosts   Avg Time   Max Time    #Rows  Est. #Rows   Peak Mem  Est. Peak Mem  Detail                        
-----------------------------------------------------------------------------------------------------------------------------
10:AGGREGATE           1  196.528us  196.528us        1           1   24.00 KB       10.00 MB  FINALIZE                      
09:EXCHANGE            1   38.678us   38.678us        2           1          0              0  UNPARTITIONED                 
05:AGGREGATE           2   68.671ms   75.556ms        2           1   50.89 KB       10.00 MB                                
04:HASH JOIN           2  460.980ms  472.610ms    8.23M     800.00K  114.07 MB        6.71 MB  RIGHT OUTER JOIN, PARTITIONED 
|--08:EXCHANGE         2   53.569ms   60.459ms  800.00K     800.00K          0              0  HASH((t1.ps_suppkey))         
|  00:SCAN HDFS        1   19.808ms   19.808ms  800.00K     800.00K   33.02 MB      176.00 MB  tpch.partsupp t1              
07:EXCHANGE            2   91.832ms  101.949ms    1.50M       1.50M          0              0  HASH(t2.o_custkey)            
03:HASH JOIN           2  206.506ms  228.767ms    1.50M       1.50M  178.05 MB        6.71 MB  LEFT OUTER JOIN, BROADCAST    
|--06:EXCHANGE         2   95.342ms   97.771ms  800.00K     800.00K          0              0  BROADCAST                     
|  02:SCAN HDFS        1   34.058ms   34.058ms  800.00K     800.00K   33.02 MB      176.00 MB  tpch.partsupp t3              
01:SCAN HDFS           2   93.905ms  101.139ms    1.50M       1.50M   33.25 MB      176.00 MB  tpch.orders t2

Now compare this to the profile with the incorrect results:

Operator          #Hosts   Avg Time   Max Time    #Rows  Est. #Rows   Peak Mem  Est. Peak Mem  Detail                         
------------------------------------------------------------------------------------------------------------------------------
10:AGGREGATE           1  198.204us  198.204us        1           1   24.00 KB       10.00 MB  FINALIZE                       
09:EXCHANGE            1   93.555us   93.555us        2           1          0              0  UNPARTITIONED                  
05:AGGREGATE           2    3.066ms    3.717ms        2           1    8.04 MB       10.00 MB                                 
04:HASH JOIN           2  140.679ms  236.316ms  800.00K     800.00K  114.06 MB        6.71 MB  RIGHT OUTER JOIN, PARTITIONED  
|--08:EXCHANGE         2   51.184ms   53.408ms  800.00K     800.00K          0              0  HASH((t1.ps_suppkey))          
|  00:SCAN HDFS        1    9.601ms    9.601ms  800.00K     800.00K   33.02 MB      176.00 MB  tpch.partsupp t1               
07:EXCHANGE            2   88.551ms  177.099ms    1.50M       1.50M          0              0  HASH(if(TupleIsNull(), NULL... 
03:HASH JOIN           2  216.025ms  241.128ms    1.50M       1.50M  178.05 MB        6.71 MB  LEFT OUTER JOIN, BROADCAST     
|--06:EXCHANGE         2   93.506ms  105.640ms  800.00K     800.00K          0              0  BROADCAST                      
|  02:SCAN HDFS        1  112.021ms  112.021ms  800.00K     800.00K   32.63 MB      176.00 MB  tpch.partsupp t3               
01:SCAN HDFS           2   50.246ms   86.962ms    1.50M       1.50M   33.25 MB      176.00 MB  tpch.orders t2

In the second, incorrect results with COALESCE() case, we see that the 04 HASH JOIN operator processed 800K rows, whereas in the first, without COALESCE, it processed 8.23M. It turns out the partsupp table has exactly 800k rows. So in the buggy case, somehow the rows coming in from the orders side are not being processed by the JOIN.

Note this was originally found by the random query generator; this was the original query:

USE tpch;
SELECT
COALESCE(AVG(a4.o_custkey), 30.0083477187, (COALESCE(a4.o_custkey, a1.ps_partkey, a1.ps_suppkey)) * (-65.4342458142)) AS float_col,
COUNT(a1.ps_suppkey) AS int_col,
COALESCE(a4.o_custkey, a1.ps_partkey, a1.ps_suppkey) AS int_col_1
FROM partsupp a1
LEFT JOIN (
SELECT
a2.o_custkey,
COALESCE(COALESCE(a3.ps_availqty, a3.ps_partkey, a3.ps_suppkey), a3.ps_suppkey, COALESCE(a2.o_custkey, a3.ps_availqty)) AS int_col,
a2.o_orderstatus,
COALESCE(a2.o_custkey, LEAST(COALESCE(a2.o_custkey, 825), COALESCE(COALESCE(a2.o_orderkey, a3.ps_availqty, a2.o_orderkey), 538)), a3.ps_availqty) AS int_col_1,
COALESCE(a2.o_orderdate, a2.o_clerk) AS char_col
FROM orders a2
LEFT JOIN partsupp a3 ON (a2.o_shippriority) = (a3.ps_partkey)
WHERE
((a2.o_custkey) IS DISTINCT FROM (a3.ps_partkey)) OR ((a2.o_totalprice) IS NULL)
) a4 ON (False) OR ((a1.ps_suppkey) = (a4.int_col_1))
WHERE
((a4.int_col_1) IS NOT DISTINCT FROM (a1.ps_suppkey)) AND ((a4.int_col) NOT IN (a1.ps_partkey, a1.ps_availqty))
GROUP BY
COALESCE(a4.o_custkey, a1.ps_partkey, a1.ps_suppkey)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

profile-wrong-results.txt
13/Jun/17 21:36
61 kB
Michael Brown
profile-coalesce-removed.txt
13/Jun/17 21:36
61 kB
Michael Brown

Activity

People

Assignee:: Alexander Behm

Reporter:: Michael Brown

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 13/Jun/17 21:29

Updated:: 21/Aug/17 22:08

Resolved:: 21/Aug/17 22:08