Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-26366

Except with transform regression

    XMLWordPrintableJSON

    Details

      Description

      There appears to be a regression between Spark 2.2 and 2.3. Below is the code to reproduce it:

       

      import org.apache.spark.sql.functions.col
      import org.apache.spark.sql.Row
      import org.apache.spark.sql.types._
      
      
      val inputDF = spark.sqlContext.createDataFrame(
        spark.sparkContext.parallelize(Seq(
          Row("0", "john", "smith", "john@smith.com"),
          Row("1", "jane", "doe", "jane@doe.com"),
          Row("2", "apache", "spark", "spark@apache.org"),
          Row("3", "foo", "bar", null)
        )),
        StructType(List(
          StructField("id", StringType, nullable=true),
          StructField("first_name", StringType, nullable=true),
          StructField("last_name", StringType, nullable=true),
          StructField("email", StringType, nullable=true)
        ))
      )
      
      val exceptDF = inputDF.transform( toProcessDF =>
        toProcessDF.filter(
            (
              col("first_name").isin(Seq("john", "jane"): _*)
                and col("last_name").isin(Seq("smith", "doe"): _*)
            )
            or col("email").isin(List(): _*)
        )
      )
      
      inputDF.except(exceptDF).show()
      

      Output with Spark 2.2:

      +---+----------+---------+----------------+
      | id|first_name|last_name| email|
      +---+----------+---------+----------------+
      | 2| apache| spark|spark@apache.org|
      | 3| foo| bar| null|
      +---+----------+---------+----------------+

      Output with Spark 2.3:

      +---+----------+---------+----------------+
      | id|first_name|last_name| email|
      +---+----------+---------+----------------+
      | 2| apache| spark|spark@apache.org|
      +---+----------+---------+----------------+

      Note, changing the last line to 

      inputDF.except(exceptDF.cache()).show()
      

      produces identical output for both Spark 2.3 and 2.2

       

        Attachments

          Activity

            People

            • Assignee:
              mgaido Marco Gaido
              Reporter:
              danospv Dan Osipov
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: