Consider the following query:
sc.parallelize(Seq(97)).toDF("int_col_6").createOrReplaceTempView("table_3") sc.parallelize(Seq(0)).toDF("int_col_1").createOrReplaceTempView("table_4") println(sql(""" SELECT * FROM ( SELECT COALESCE(t2.int_col_1, t1.int_col_6) AS int_col FROM table_3 t1 LEFT JOIN table_4 t2 ON false ) t where (t.int_col) is not null """).collect().toSeq)
In the innermost query, the LEFT JOIN's condition is false but nevertheless the number of rows produced should equal the number of rows in table_3 (which is non-empty). Since no values are null, the outer where should retain all rows, so the overall result of this query should contain a single row with the value '97'.
Instead, the current Spark master (as of 12a89e55cbd630fa2986da984e066cd07d3bf1f7 at least) returns no rows. Looking at explain, it appears that the logical plan is optimizing to LocalRelation <empty>, so Spark doesn't even run the query. My suspicion is that there's a bug in constraint propagation or filter pushdown.
This issue doesn't seem to affect Spark 2.0, so I think it's a regression in master.
SPARK-16991 Full outer join followed by inner join produces wrong results
- links to