Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-22951

count() after dropDuplicates() on emptyDataFrame returns incorrect value

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.6.2, 2.1.3, 2.2.0, 2.3.0
    • 2.2.3, 2.3.0
    • SQL

    Description

      here is a minimal Spark Application to reproduce:

      import org.apache.spark.sql.SQLContext
      import org.apache.spark.{SparkConf, SparkContext}
      
      
      object DropDupesApp extends App {
        
        override def main(args: Array[String]): Unit = {
          val conf = new SparkConf()
            .setAppName("test")
            .setMaster("local")
          val sc = new SparkContext(conf)
          val sql = SQLContext.getOrCreate(sc)
          assert(sql.emptyDataFrame.count == 0) // expected
          assert(sql.emptyDataFrame.dropDuplicates.count == 1) // unexpected
        }
        
      }
      

      Attachments

        Activity

          People

            fengliu@databricks.com Feng Liu
            belbis Michael Dreibelbis
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: