Details
Description
Random query generation uncovered the following query which returns incorrect results when run on Spark SQL. This wasn't the original query uncovered by the generator, since I performed a bit of minimization to try to make it more understandable.
With the following tables:
val t1 = sc.parallelize(Seq(-234, 145, 367, 975, 298)).toDF("int_col_5") val t2 = sc.parallelize( Seq( (-769, -244), (-800, -409), (940, 86), (-507, 304), (-367, 158)) ).toDF("int_col_2", "int_col_5") t1.registerTempTable("t1") t2.registerTempTable("t2")
Run
SELECT (SUM(COALESCE(t1.int_col_5, t2.int_col_2))), ((COALESCE(t1.int_col_5, t2.int_col_2)) * 2) FROM t1 RIGHT JOIN t2 ON (t2.int_col_2) = (t1.int_col_5) GROUP BY GREATEST(COALESCE(t2.int_col_5, 109), COALESCE(t1.int_col_5, -449)), COALESCE(t1.int_col_5, t2.int_col_2) HAVING (SUM(COALESCE(t1.int_col_5, t2.int_col_2))) > ((COALESCE(t1.int_col_5, t2.int_col_2)) * 2)
In Spark SQL, this returns an empty result set, whereas Postgres returns four rows. However, if I omit the HAVING clause I see that the group's rows are being incorrectly filtered by the HAVING clause:
+--------------------------------------+---------------------------------------+--+ | sum(coalesce(int_col_5, int_col_2)) | (coalesce(int_col_5, int_col_2) * 2) | +--------------------------------------+---------------------------------------+--+ | -507 | -1014 | | 940 | 1880 | | -769 | -1538 | | -367 | -734 | | -800 | -1600 | +--------------------------------------+---------------------------------------+--+
Based on this, the output after adding the HAVING should contain four rows, not zero.
I'm not sure how to further shrink this in a straightforward way, so I'm opening this bug to get help in triaging further.
Attachments
Issue Links
- duplicates
-
SPARK-16991 Full outer join followed by inner join produces wrong results
- Resolved
- links to