[SPARK-25368] Incorrect constraint inference returns wrong result - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 2.3.0, 2.3.1
Fix Version/s: 2.3.2, 2.4.0
Component/s: Optimizer, SQL
Labels:
- correctness

Description

there is a breaking change in spark 2.3 (I checked on 2.3.1 and 2.3.2-rc5)

the following code recreates the problem
(it's a bit convoluted examples, I tried to simplify it as much as possible from my code)

import org.apache.spark.sql.{DataFrame, SQLContext}
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._

import spark.implicits._

case class Data(a: Option[Int],b: String,c: Option[String],d: String)

val df1 = spark.createDataFrame(Seq(
   Data(Some(1), "1", None, "1"),
   Data(None, "2", Some("2"), "2")
))

val df2 = df1
.where( $"a".isNotNull)
.withColumn("e", lit(null).cast("string"))

val columns = df2.columns.map(c => col(c))

val df3 = df1
.select(
  $"c",
  $"b" as "e"
  )
  .withColumn("a", lit(null).cast("int"))
  .withColumn("b", lit(null).cast("string"))
  .withColumn("d", lit(null).cast("string"))
  .select(columns :_*)

val df4 =
  df2.union(df3)
  .withColumn("e", last(col("e"), ignoreNulls = true).over(Window.partitionBy($"c").orderBy($"d")))
  .filter($"a".isNotNull)

df4.show

notice that the last statement in for df4 is to filter rows where a is null

in spark 2.2.1, the above code prints:

+---+---+----+---+---+ 
| a| b| c| d| e|
 +---+---+----+---+---+ 
| 1| 1|null| 1| 1| 
+---+---+----+---+---+

in spark 2.3.x, it prints:

+----+----+----+----+---+ 
| a| b| c| d| e| 
+----+----+----+----+---+ 
|null|null|null|null| 1| 
| 1| 1|null| 1| 1| 
|null|null| 2|null| 2|
 +----+----+----+----+---+

the column a still contains null values

attached are the plans.

in the parsed logical plan, the filter for isnotnull('a), is on top,
but in the optimized logical plan, it is pushed down

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

plan.txt
07/Sep/18 12:09
3 kB
Lev Katzav

Issue Links

links to

[Github] Pull Request #22368 (wangyum)

Activity

People

Assignee:: Yuming Wang

Reporter:: Lev Katzav

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 07/Sep/18 12:06

Updated:: 09/Sep/18 22:49

Resolved:: 09/Sep/18 16:10