Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-17120

Analyzer incorrectly optimizes plan to empty LocalRelation

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 2.1.0
    • 2.0.1, 2.1.0
    • SQL

    Description

      Consider the following query:

      sc.parallelize(Seq(97)).toDF("int_col_6").createOrReplaceTempView("table_3")
      sc.parallelize(Seq(0)).toDF("int_col_1").createOrReplaceTempView("table_4")
      
      println(sql("""
        SELECT
        *
        FROM (
        SELECT
            COALESCE(t2.int_col_1, t1.int_col_6) AS int_col
            FROM table_3 t1
            LEFT JOIN table_4 t2 ON false
        ) t where (t.int_col) is not null
      """).collect().toSeq)
      

      In the innermost query, the LEFT JOIN's condition is false but nevertheless the number of rows produced should equal the number of rows in table_3 (which is non-empty). Since no values are null, the outer where should retain all rows, so the overall result of this query should contain a single row with the value '97'.

      Instead, the current Spark master (as of 12a89e55cbd630fa2986da984e066cd07d3bf1f7 at least) returns no rows. Looking at explain, it appears that the logical plan is optimizing to LocalRelation <empty>, so Spark doesn't even run the query. My suspicion is that there's a bug in constraint propagation or filter pushdown.

      This issue doesn't seem to affect Spark 2.0, so I think it's a regression in master.

      Attachments

        Issue Links

          Activity

            People

              smilegator Xiao Li
              joshrosen Josh Rosen
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: