Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.3.3
-
None
-
None
-
EKS
EMR
Description
Version: 3.3.3
scala> val df = spark.emptyDataFrame
df: org.apache.spark.sql.DataFrame = []
scala> df.rdd.getNumPartitions
res0: Int = 0
scala> df.repartition(1).rdd.getNumPartitions
res1: Int = 1
scala> df.repartition(1).rdd.isEmpty()
[Stage 1:> (0 + 1) / res2: Boolean = true
Version: 3.2.4
scala> val df = spark.emptyDataFrame
df: org.apache.spark.sql.DataFrame = []
scala> df.rdd.getNumPartitions
res0: Int = 0
scala> df.repartition(1).rdd.getNumPartitions
res1: Int = 0
scala> df.repartition(1).rdd.isEmpty()
res2: Boolean = true
Version: 3.5.0
scala> val df = spark.emptyDataFrame
df: org.apache.spark.sql.DataFrame = []
scala> df.rdd.getNumPartitions
res0: Int = 0
scala> df.repartition(1).rdd.getNumPartitions
res1: Int = 1
scala> df.repartition(1).rdd.isEmpty()
[Stage 1:> (0 + 1) / res2: Boolean = true
When we do repartition of 1 on an empty dataframe, the resultant partition is 1 in version 3.3.x and 3.5.x whereas when I do the same in version 3.2.x, the resultant partition is 0. May i know why this behaviour is changed from 3.2.x to higher versions.
The reason for raising this as a bug is I have a scenario where my final dataframe returns 0 records in EKS(local spark) with single node(driver and executor on the sam node) but it returns 1 in EMR both uses a same spark version 3.3.3. I'm not sure why this behaves different in both the environments. As a interim solution, I had to repartition a empty dataframe if my final dataframe is empty which returns 1 for 3.3.3. Would like to know if this really a bug or this behaviour exists in the future versions and cannot be changed?
Because, If we go for a spark upgrade and this behaviour is changed, we will face the issue again.
Please confirm on this.