Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-25368

Incorrect constraint inference returns wrong result

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 2.3.0, 2.3.1
    • Fix Version/s: 2.3.2, 2.4.0
    • Component/s: Optimizer, SQL
    • Labels:

      Description

      there is a breaking change in spark 2.3 (I checked on 2.3.1 and 2.3.2-rc5)

      the following code recreates the problem
      (it's a bit convoluted examples, I tried to simplify it as much as possible from my code)

      import org.apache.spark.sql.{DataFrame, SQLContext}
      import org.apache.spark.sql.expressions.Window
      import org.apache.spark.sql.functions._
      
      import spark.implicits._
      
      case class Data(a: Option[Int],b: String,c: Option[String],d: String)
      
      val df1 = spark.createDataFrame(Seq(
         Data(Some(1), "1", None, "1"),
         Data(None, "2", Some("2"), "2")
      ))
      
      val df2 = df1
      .where( $"a".isNotNull)
      .withColumn("e", lit(null).cast("string"))
      
      val columns = df2.columns.map(c => col(c))
      
      val df3 = df1
      .select(
        $"c",
        $"b" as "e"
        )
        .withColumn("a", lit(null).cast("int"))
        .withColumn("b", lit(null).cast("string"))
        .withColumn("d", lit(null).cast("string"))
        .select(columns :_*)
      
      val df4 =
        df2.union(df3)
        .withColumn("e", last(col("e"), ignoreNulls = true).over(Window.partitionBy($"c").orderBy($"d")))
        .filter($"a".isNotNull)
      
      df4.show
      
      

       

      notice that the last statement in for df4 is to filter rows where a is null

      in spark 2.2.1, the above code prints:

      +---+---+----+---+---+ 
      | a| b| c| d| e|
       +---+---+----+---+---+ 
      | 1| 1|null| 1| 1| 
      +---+---+----+---+---+
      

      in spark 2.3.x, it prints: 

      +----+----+----+----+---+ 
      | a| b| c| d| e| 
      +----+----+----+----+---+ 
      |null|null|null|null| 1| 
      | 1| 1|null| 1| 1| 
      |null|null| 2|null| 2|
       +----+----+----+----+---+
      

       the column a still contains null values

       

      attached are the plans.

      in the parsed logical plan, the filter for isnotnull('a), is on top,
      but in the optimized logical plan, it is pushed down

        Attachments

        1. plan.txt
          3 kB
          Lev Katzav

          Activity

            People

            • Assignee:
              yumwang Yuming Wang
              Reporter:
              lev Lev Katzav
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: