Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-46105

df.emptyDataFrame shows 1 if we repartition(1) in Spark 3.3.x and above

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.3.3
    • None
    • Spark Core
    • None
    • EKS

      EMR

    Description

      Version: 3.3.3

       

      scala> val df = spark.emptyDataFrame
      df: org.apache.spark.sql.DataFrame = []

      scala> df.rdd.getNumPartitions
      res0: Int = 0

      scala> df.repartition(1).rdd.getNumPartitions
      res1: Int = 1

      scala> df.repartition(1).rdd.isEmpty()
      [Stage 1:>                                                          (0 + 1) /                                                                             res2: Boolean = true

      Version: 3.2.4

      scala> val df = spark.emptyDataFrame
      df: org.apache.spark.sql.DataFrame = []

      scala> df.rdd.getNumPartitions
      res0: Int = 0

      scala> df.repartition(1).rdd.getNumPartitions
      res1: Int = 0

      scala> df.repartition(1).rdd.isEmpty()
      res2: Boolean = true

       

      Version: 3.5.0

      scala> val df = spark.emptyDataFrame
      df: org.apache.spark.sql.DataFrame = []

      scala> df.rdd.getNumPartitions
      res0: Int = 0

      scala> df.repartition(1).rdd.getNumPartitions
      res1: Int = 1

      scala> df.repartition(1).rdd.isEmpty()
      [Stage 1:>                                                          (0 + 1) /                                                                             res2: Boolean = true

       

      When we do repartition of 1 on an empty dataframe, the resultant partition is 1 in version 3.3.x and 3.5.x whereas when I do the same in version 3.2.x, the resultant partition is 0. May i know why this behaviour is changed from 3.2.x to higher versions. 

       

      The reason for raising this as a bug is I have a scenario where my final dataframe returns 0 records in EKS(local spark) with single node(driver and executor on the sam node) but it returns 1 in EMR both uses a same spark version 3.3.3. I'm not sure why this behaves different in both the environments. As a interim solution, I had to repartition a empty dataframe if my final dataframe is empty which returns 1 for 3.3.3. Would like to know if this really a bug or this behaviour exists in the future versions and cannot be changed?

       

      Because, If we go for a spark upgrade and this behaviour is changed, we will face the issue again. 

      Please confirm on this.

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            sparkdharani dharani_sugumar
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: