Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
2.3.0, 2.3.1, 2.3.2
-
Spark version 2.3.2
Description
Hi,
I recently played with the repartitionByRange method for DataFrame introduced in SPARK-22614. For DataFrames larger than the one tested in the code (which has only 10 elements), the code sends back random results.
As a test for showing the inconsistent behaviour, I start as the unit code used to test repartitionByRange (here) but I increase the size of the initial array to 1000, repartition using 3 partitions, and count the number of element per-partitions:
// Shuffle numbers from 0 to 1000, and make a DataFrame val df = Random.shuffle(0.to(1000)).toDF("val") // Repartition it using 3 partitions // Sum up number of elements in each partition, and collect it. // And do it several times for (i <- 0 to 9) { var counts = df.repartitionByRange(3, col("val")) .mapPartitions{part => Iterator(part.size)} .collect() println(counts.toList) } // -> the number of elements in each partition varies...
I do not know whether it is expected (I will dig further in the code), but it sounds like a bug.
Or I just misinterpret what repartitionByRange is for?
Any ideas?
Thanks!
Julien