Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
2.3.0, 2.3.1, 2.3.2
Description
There appears to be a regression between Spark 2.2 and 2.3. Below is the code to reproduce it:
import org.apache.spark.sql.functions.col import org.apache.spark.sql.Row import org.apache.spark.sql.types._ val inputDF = spark.sqlContext.createDataFrame( spark.sparkContext.parallelize(Seq( Row("0", "john", "smith", "john@smith.com"), Row("1", "jane", "doe", "jane@doe.com"), Row("2", "apache", "spark", "spark@apache.org"), Row("3", "foo", "bar", null) )), StructType(List( StructField("id", StringType, nullable=true), StructField("first_name", StringType, nullable=true), StructField("last_name", StringType, nullable=true), StructField("email", StringType, nullable=true) )) ) val exceptDF = inputDF.transform( toProcessDF => toProcessDF.filter( ( col("first_name").isin(Seq("john", "jane"): _*) and col("last_name").isin(Seq("smith", "doe"): _*) ) or col("email").isin(List(): _*) ) ) inputDF.except(exceptDF).show()
Output with Spark 2.2:
+---+----------+---------+----------------+ | id|first_name|last_name| email| +---+----------+---------+----------------+ | 2| apache| spark|spark@apache.org| | 3| foo| bar| null| +---+----------+---------+----------------+
Output with Spark 2.3:
+---+----------+---------+----------------+ | id|first_name|last_name| email| +---+----------+---------+----------------+ | 2| apache| spark|spark@apache.org| +---+----------+---------+----------------+
Note, changing the last line to
inputDF.except(exceptDF.cache()).show()
produces identical output for both Spark 2.3 and 2.2