Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-26024

Dataset API: repartitionByRange(...) has inconsistent behaviour

    XMLWordPrintableJSON

Details

    Description

      Hi,

      I recently played with the repartitionByRange method for DataFrame introduced in SPARK-22614. For DataFrames larger than the one tested in the code (which has only 10 elements), the code sends back random results.

      As a test for showing the inconsistent behaviour, I start as the unit code used to test repartitionByRange (here) but I increase the size of the initial array to 1000, repartition using 3 partitions, and count the number of element per-partitions:

       

      // Shuffle numbers from 0 to 1000, and make a DataFrame
      val df = Random.shuffle(0.to(1000)).toDF("val")
      
      // Repartition it using 3 partitions
      // Sum up number of elements in each partition, and collect it.
      // And do it several times
      for (i <- 0 to 9) {
        var counts = df.repartitionByRange(3, col("val"))
          .mapPartitions{part => Iterator(part.size)}
          .collect()
        println(counts.toList)
      }
      // -> the number of elements in each partition varies...
      

      I do not know whether it is expected (I will dig further in the code), but it sounds like a bug.
      Or I just misinterpret what repartitionByRange is for?
      Any ideas?

      Thanks!
      Julien

      Attachments

        Activity

          People

            JulienPeloton Julien Peloton
            JulienPeloton Julien Peloton
            Adrian Ionescu Adrian Ionescu
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: