[SPARK-26024] Dataset API: repartitionByRange(...) has inconsistent behaviour - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.3.0, 2.3.1, 2.3.2
Fix Version/s: 3.0.0
Component/s: SQL
Labels:
Environment:

Spark version 2.3.2

Description

Hi,

I recently played with the repartitionByRange method for DataFrame introduced in ~~SPARK-22614~~. For DataFrames larger than the one tested in the code (which has only 10 elements), the code sends back random results.

As a test for showing the inconsistent behaviour, I start as the unit code used to test repartitionByRange (here) but I increase the size of the initial array to 1000, repartition using 3 partitions, and count the number of element per-partitions:

// Shuffle numbers from 0 to 1000, and make a DataFrame
val df = Random.shuffle(0.to(1000)).toDF("val")

// Repartition it using 3 partitions
// Sum up number of elements in each partition, and collect it.
// And do it several times
for (i <- 0 to 9) {
  var counts = df.repartitionByRange(3, col("val"))
    .mapPartitions{part => Iterator(part.size)}
    .collect()
  println(counts.toList)
}
// -> the number of elements in each partition varies...

I do not know whether it is expected (I will dig further in the code), but it sounds like a bug.
Or I just misinterpret what repartitionByRange is for?
Any ideas?

Thanks!
Julien

Attachments

Issue Links

links to

[Github] Pull Request #23025 (JulienPeloton)

[Github] Pull Request #23167 (srowen)

Activity

People

Assignee:: Julien Peloton

Reporter:: Julien Peloton

Shepherd:: Adrian Ionescu

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 12/Nov/18 21:48

Updated:: 28/Nov/18 17:01

Resolved:: 19/Nov/18 14:26