Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-46105

df.emptyDataFrame shows 1 if we repartition(1) in Spark 3.3.x and above

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.3.3
    • None
    • Spark Core
    • None
    • EKS

      EMR

    Description

      Version: 3.3.3

       

      scala> val df = spark.emptyDataFrame
      df: org.apache.spark.sql.DataFrame = []

      scala> df.rdd.getNumPartitions
      res0: Int = 0

      scala> df.repartition(1).rdd.getNumPartitions
      res1: Int = 1

      scala> df.repartition(1).rdd.isEmpty()
      [Stage 1:>                                                          (0 + 1) /                                                                             res2: Boolean = true

      Version: 3.2.4

      scala> val df = spark.emptyDataFrame
      df: org.apache.spark.sql.DataFrame = []

      scala> df.rdd.getNumPartitions
      res0: Int = 0

      scala> df.repartition(1).rdd.getNumPartitions
      res1: Int = 0

      scala> df.repartition(1).rdd.isEmpty()
      res2: Boolean = true

       

      Version: 3.5.0

      scala> val df = spark.emptyDataFrame
      df: org.apache.spark.sql.DataFrame = []

      scala> df.rdd.getNumPartitions
      res0: Int = 0

      scala> df.repartition(1).rdd.getNumPartitions
      res1: Int = 1

      scala> df.repartition(1).rdd.isEmpty()
      [Stage 1:>                                                          (0 + 1) /                                                                             res2: Boolean = true

       

      When we do repartition of 1 on an empty dataframe, the resultant partition is 1 in version 3.3.x and 3.5.x whereas when I do the same in version 3.2.x, the resultant partition is 0. May i know why this behaviour is changed from 3.2.x to higher versions. 

       

      The reason for raising this as a bug is I have a scenario where my final dataframe returns 0 records in EKS(local spark) with single node(driver and executor on the sam node) but it returns 1 in EMR both uses a same spark version 3.3.3. I'm not sure why this behaves different in both the environments. As a interim solution, I had to repartition a empty dataframe if my final dataframe is empty which returns 1 for 3.3.3. Would like to know if this really a bug or this behaviour exists in the future versions and cannot be changed?

       

      Because, If we go for a spark upgrade and this behaviour is changed, we will face the issue again. 

      Please confirm on this.

       

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            sparkdharani dharani_sugumar

            Dates

              Created:
              Updated:

              Slack

                Issue deployment