Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-25368

Incorrect constraint inference returns wrong result

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 2.3.0, 2.3.1
    • 2.3.2, 2.4.0
    • Optimizer, SQL

    Description

      there is a breaking change in spark 2.3 (I checked on 2.3.1 and 2.3.2-rc5)

      the following code recreates the problem
      (it's a bit convoluted examples, I tried to simplify it as much as possible from my code)

      import org.apache.spark.sql.{DataFrame, SQLContext}
      import org.apache.spark.sql.expressions.Window
      import org.apache.spark.sql.functions._
      
      import spark.implicits._
      
      case class Data(a: Option[Int],b: String,c: Option[String],d: String)
      
      val df1 = spark.createDataFrame(Seq(
         Data(Some(1), "1", None, "1"),
         Data(None, "2", Some("2"), "2")
      ))
      
      val df2 = df1
      .where( $"a".isNotNull)
      .withColumn("e", lit(null).cast("string"))
      
      val columns = df2.columns.map(c => col(c))
      
      val df3 = df1
      .select(
        $"c",
        $"b" as "e"
        )
        .withColumn("a", lit(null).cast("int"))
        .withColumn("b", lit(null).cast("string"))
        .withColumn("d", lit(null).cast("string"))
        .select(columns :_*)
      
      val df4 =
        df2.union(df3)
        .withColumn("e", last(col("e"), ignoreNulls = true).over(Window.partitionBy($"c").orderBy($"d")))
        .filter($"a".isNotNull)
      
      df4.show
      
      

       

      notice that the last statement in for df4 is to filter rows where a is null

      in spark 2.2.1, the above code prints:

      +---+---+----+---+---+ 
      | a| b| c| d| e|
       +---+---+----+---+---+ 
      | 1| 1|null| 1| 1| 
      +---+---+----+---+---+
      

      in spark 2.3.x, it prints: 

      +----+----+----+----+---+ 
      | a| b| c| d| e| 
      +----+----+----+----+---+ 
      |null|null|null|null| 1| 
      | 1| 1|null| 1| 1| 
      |null|null| 2|null| 2|
       +----+----+----+----+---+
      

       the column a still contains null values

       

      attached are the plans.

      in the parsed logical plan, the filter for isnotnull('a), is on top,
      but in the optimized logical plan, it is pushed down

      Attachments

        1. plan.txt
          3 kB
          Lev Katzav

        Activity

          People

            yumwang Yuming Wang
            lev Lev Katzav
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: