Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-22665

Dataset API: .repartition() inconsistency / issue

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.2.0
    • Fix Version/s: 2.3.0
    • Component/s: SQL
    • Labels:
      None

      Description

      We currently have two functions for explicitly repartitioning a Dataset:

      def repartition(numPartitions: Int)
      

      and

      def repartition(numPartitions: Int, partitionExprs: Column*)
      

      The second function's signature allows it to be called with an empty list of expressions as well.

      However:

      • df.repartition(numPartitions) does RoundRobin partitioning
      • df.repartition(numPartitions, Seq.empty: _*) does HashPartitioning on a constant, effectively moving all tuples to a single partition

      Not only is this inconsistent, but the latter behavior is very undesirable: it may hide problems in small-scale prototype code, but will inevitably fail (or have terrible performance) in production.

      I suggest we should make it:

      • either throw an IllegalArgumentException
      • or do RoundRobin partitioning, just like df.repartition(numPartitions)

        Attachments

          Activity

            People

            • Assignee:
              mgaido Marco Gaido
              Reporter:
              a.ionescu Adrian Ionescu
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: