Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
3.2.0
Description
Simple code that allows to reproduce this issue:
val frame = Seq((false, 1)).toDF("bool", "number") frame .checkpoint() .withColumn("conditions", when(col("bool"), "I am not null")) .filter(col("conditions").isNull) .show(false)
Although "conditions" column is null
+-----+------+----------+ |bool |number|conditions| +-----+------+----------+ |false|1 |null | +-----+------+----------+
empty result is shown.
Execution plans:
== Parsed Logical Plan == 'Filter isnull('conditions) +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END AS conditions#252] +- LogicalRDD [bool#124, number#125], false == Analyzed Logical Plan == bool: boolean, number: int, conditions: string Filter isnull(conditions#252) +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END AS conditions#252] +- LogicalRDD [bool#124, number#125], false == Optimized Logical Plan == LocalRelation <empty>, [bool#124, number#125, conditions#252] == Physical Plan == LocalTableScan <empty>, [bool#124, number#125, conditions#252]
After removing checkpoint proper result is returned and execution plans are as follow:
== Parsed Logical Plan == 'Filter isnull('conditions) +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END AS conditions#256] +- Project [_1#119 AS bool#124, _2#120 AS number#125] +- LocalRelation [_1#119, _2#120] == Analyzed Logical Plan == bool: boolean, number: int, conditions: string Filter isnull(conditions#256) +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END AS conditions#256] +- Project [_1#119 AS bool#124, _2#120 AS number#125] +- LocalRelation [_1#119, _2#120] == Optimized Logical Plan == LocalRelation [bool#124, number#125, conditions#256] == Physical Plan == LocalTableScan [bool#124, number#125, conditions#256]
It seems that the most important difference is LogicalRDD -> LocalRelation
There are following ways (workarounds) to retrieve correct result:
1) remove checkpoint
2) add explicit .otherwise(null) to when
3) add checkpoint() or cache() just before filter
4) downgrade to Spark 3.1.2