Details
-
Bug
-
Status: Resolved
-
Blocker
-
Resolution: Fixed
-
2.3.0, 2.3.1
Description
there is a breaking change in spark 2.3 (I checked on 2.3.1 and 2.3.2-rc5)
the following code recreates the problem
(it's a bit convoluted examples, I tried to simplify it as much as possible from my code)
import org.apache.spark.sql.{DataFrame, SQLContext} import org.apache.spark.sql.expressions.Window import org.apache.spark.sql.functions._ import spark.implicits._ case class Data(a: Option[Int],b: String,c: Option[String],d: String) val df1 = spark.createDataFrame(Seq( Data(Some(1), "1", None, "1"), Data(None, "2", Some("2"), "2") )) val df2 = df1 .where( $"a".isNotNull) .withColumn("e", lit(null).cast("string")) val columns = df2.columns.map(c => col(c)) val df3 = df1 .select( $"c", $"b" as "e" ) .withColumn("a", lit(null).cast("int")) .withColumn("b", lit(null).cast("string")) .withColumn("d", lit(null).cast("string")) .select(columns :_*) val df4 = df2.union(df3) .withColumn("e", last(col("e"), ignoreNulls = true).over(Window.partitionBy($"c").orderBy($"d"))) .filter($"a".isNotNull) df4.show
notice that the last statement in for df4 is to filter rows where a is null
in spark 2.2.1, the above code prints:
+---+---+----+---+---+
| a| b| c| d| e|
+---+---+----+---+---+
| 1| 1|null| 1| 1|
+---+---+----+---+---+
in spark 2.3.x, it prints:
+----+----+----+----+---+ | a| b| c| d| e| +----+----+----+----+---+ |null|null|null|null| 1| | 1| 1|null| 1| 1| |null|null| 2|null| 2| +----+----+----+----+---+
the column a still contains null values
attached are the plans.
in the parsed logical plan, the filter for isnotnull('a), is on top,
but in the optimized logical plan, it is pushed down