Details
-
Bug
-
Status: Resolved
-
Blocker
-
Resolution: Fixed
-
Impala 2.8.0, Impala 2.9.0
-
ghx-label-4
Description
Summary
This query
USE tpch; SELECT COUNT(t1.ps_suppkey) FROM partsupp t1 LEFT JOIN ( SELECT COALESCE(t2.o_custkey, -1) AS coalesce_col FROM orders t2 LEFT JOIN partsupp t3 ON (t2.o_shippriority) = (t3.ps_partkey) ) sq ON ((t1.ps_suppkey) = (sq.coalesce_col))
produces a different result compared to PostgreSQL. This occurs in both 2.8- and 2.9-based builds, so it doesn't seem to be a recent regression.
The problem is the COALESCE() function in the inline view and its relationship to the LEFT JOIN in the outer query.
Analysis
Because t2.o_custkey is a primary key column and does not contain NULL values:
[localhost:21000] > select count(o_custkey) from orders where o_custkey is null; +------------------+ | count(o_custkey) | +------------------+ | 0 | +------------------+ Fetched 1 row(s) in 0.64s [localhost:21000] >
This means I can remove the use of COALESCE(t2.o_custkey, -1) and just replace it with t2.o_custkey:
SELECT COUNT(t1.ps_suppkey) c FROM partsupp t1 LEFT JOIN ( SELECT t2.o_custkey AS coalesce_col FROM orders t2 LEFT JOIN partsupp t3 ON (t2.o_shippriority) = (t3.ps_partkey) ) sq ON ((t1.ps_suppkey) = (sq.coalesce_col)) ORDER BY c
When I do that, the results end up matching Postgres. This is my grounds for justifying this as an Impala bug, not a Postgres bug: either with or without the COALESCE(), the results should have been the same.
Consider the query profiles for the query without the COALESCE, that produces the correct results:
Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail ----------------------------------------------------------------------------------------------------------------------------- 10:AGGREGATE 1 196.528us 196.528us 1 1 24.00 KB 10.00 MB FINALIZE 09:EXCHANGE 1 38.678us 38.678us 2 1 0 0 UNPARTITIONED 05:AGGREGATE 2 68.671ms 75.556ms 2 1 50.89 KB 10.00 MB 04:HASH JOIN 2 460.980ms 472.610ms 8.23M 800.00K 114.07 MB 6.71 MB RIGHT OUTER JOIN, PARTITIONED |--08:EXCHANGE 2 53.569ms 60.459ms 800.00K 800.00K 0 0 HASH((t1.ps_suppkey)) | 00:SCAN HDFS 1 19.808ms 19.808ms 800.00K 800.00K 33.02 MB 176.00 MB tpch.partsupp t1 07:EXCHANGE 2 91.832ms 101.949ms 1.50M 1.50M 0 0 HASH(t2.o_custkey) 03:HASH JOIN 2 206.506ms 228.767ms 1.50M 1.50M 178.05 MB 6.71 MB LEFT OUTER JOIN, BROADCAST |--06:EXCHANGE 2 95.342ms 97.771ms 800.00K 800.00K 0 0 BROADCAST | 02:SCAN HDFS 1 34.058ms 34.058ms 800.00K 800.00K 33.02 MB 176.00 MB tpch.partsupp t3 01:SCAN HDFS 2 93.905ms 101.139ms 1.50M 1.50M 33.25 MB 176.00 MB tpch.orders t2
Now compare this to the profile with the incorrect results:
Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail ------------------------------------------------------------------------------------------------------------------------------ 10:AGGREGATE 1 198.204us 198.204us 1 1 24.00 KB 10.00 MB FINALIZE 09:EXCHANGE 1 93.555us 93.555us 2 1 0 0 UNPARTITIONED 05:AGGREGATE 2 3.066ms 3.717ms 2 1 8.04 MB 10.00 MB 04:HASH JOIN 2 140.679ms 236.316ms 800.00K 800.00K 114.06 MB 6.71 MB RIGHT OUTER JOIN, PARTITIONED |--08:EXCHANGE 2 51.184ms 53.408ms 800.00K 800.00K 0 0 HASH((t1.ps_suppkey)) | 00:SCAN HDFS 1 9.601ms 9.601ms 800.00K 800.00K 33.02 MB 176.00 MB tpch.partsupp t1 07:EXCHANGE 2 88.551ms 177.099ms 1.50M 1.50M 0 0 HASH(if(TupleIsNull(), NULL... 03:HASH JOIN 2 216.025ms 241.128ms 1.50M 1.50M 178.05 MB 6.71 MB LEFT OUTER JOIN, BROADCAST |--06:EXCHANGE 2 93.506ms 105.640ms 800.00K 800.00K 0 0 BROADCAST | 02:SCAN HDFS 1 112.021ms 112.021ms 800.00K 800.00K 32.63 MB 176.00 MB tpch.partsupp t3 01:SCAN HDFS 2 50.246ms 86.962ms 1.50M 1.50M 33.25 MB 176.00 MB tpch.orders t2
In the second, incorrect results with COALESCE() case, we see that the 04 HASH JOIN operator processed 800K rows, whereas in the first, without COALESCE, it processed 8.23M. It turns out the partsupp table has exactly 800k rows. So in the buggy case, somehow the rows coming in from the orders side are not being processed by the JOIN.
Note this was originally found by the random query generator; this was the original query:
USE tpch; SELECT COALESCE(AVG(a4.o_custkey), 30.0083477187, (COALESCE(a4.o_custkey, a1.ps_partkey, a1.ps_suppkey)) * (-65.4342458142)) AS float_col, COUNT(a1.ps_suppkey) AS int_col, COALESCE(a4.o_custkey, a1.ps_partkey, a1.ps_suppkey) AS int_col_1 FROM partsupp a1 LEFT JOIN ( SELECT a2.o_custkey, COALESCE(COALESCE(a3.ps_availqty, a3.ps_partkey, a3.ps_suppkey), a3.ps_suppkey, COALESCE(a2.o_custkey, a3.ps_availqty)) AS int_col, a2.o_orderstatus, COALESCE(a2.o_custkey, LEAST(COALESCE(a2.o_custkey, 825), COALESCE(COALESCE(a2.o_orderkey, a3.ps_availqty, a2.o_orderkey), 538)), a3.ps_availqty) AS int_col_1, COALESCE(a2.o_orderdate, a2.o_clerk) AS char_col FROM orders a2 LEFT JOIN partsupp a3 ON (a2.o_shippriority) = (a3.ps_partkey) WHERE ((a2.o_custkey) IS DISTINCT FROM (a3.ps_partkey)) OR ((a2.o_totalprice) IS NULL) ) a4 ON (False) OR ((a1.ps_suppkey) = (a4.int_col_1)) WHERE ((a4.int_col_1) IS NOT DISTINCT FROM (a1.ps_suppkey)) AND ((a4.int_col) NOT IN (a1.ps_partkey, a1.ps_availqty)) GROUP BY COALESCE(a4.o_custkey, a1.ps_partkey, a1.ps_suppkey)