[SPARK-17120] Analyzer incorrectly optimizes plan to empty LocalRelation - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 2.1.0
Fix Version/s: 2.0.1, 2.1.0
Component/s: SQL
Labels:
- correctness

Target Version/s:

2.0.1, 2.1.0

Description

Consider the following query:

sc.parallelize(Seq(97)).toDF("int_col_6").createOrReplaceTempView("table_3")
sc.parallelize(Seq(0)).toDF("int_col_1").createOrReplaceTempView("table_4")

println(sql("""
  SELECT
  *
  FROM (
  SELECT
      COALESCE(t2.int_col_1, t1.int_col_6) AS int_col
      FROM table_3 t1
      LEFT JOIN table_4 t2 ON false
  ) t where (t.int_col) is not null
""").collect().toSeq)

In the innermost query, the LEFT JOIN's condition is false but nevertheless the number of rows produced should equal the number of rows in table_3 (which is non-empty). Since no values are null, the outer where should retain all rows, so the overall result of this query should contain a single row with the value '97'.

Instead, the current Spark master (as of 12a89e55cbd630fa2986da984e066cd07d3bf1f7 at least) returns no rows. Looking at explain, it appears that the logical plan is optimizing to LocalRelation <empty>, so Spark doesn't even run the query. My suspicion is that there's a bug in constraint propagation or filter pushdown.

This issue doesn't seem to affect Spark 2.0, so I think it's a regression in master.

Attachments

Issue Links

duplicates

SPARK-16991 Full outer join followed by inner join produces wrong results

Resolved

links to

[Github] Pull Request #14661 (gatorsmile)

Activity

People

Assignee:: Xiao Li

Reporter:: Josh Rosen

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 18/Aug/16 00:47

Updated:: 31/Aug/16 21:14

Resolved:: 25/Aug/16 12:24