Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-22614

Expose range partitioning shuffle

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.3.0
    • Fix Version/s: 2.3.0
    • Component/s: Shuffle, SQL
    • Labels:
      None

      Description

      Right now, the Dataset API only offers two possibilities for explicitly repartitioning a dataset:

      • round robin partitioning, via def repartition(numPartitions: Int)
      • hash partitioning, via def repartition(numPartitions: Int, partitionExprs: Column*)

      It would be useful to also expose range partitioning, which can, for example, improve compression when writing data out to disk, or potentially enable new use cases.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                a.ionescu Adrian Ionescu
                Reporter:
                a.ionescu Adrian Ionescu
                Shepherd:
                Herman van Hovell
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: