We currently have two functions for explicitly repartitioning a Dataset:
The second function's signature allows it to be called with an empty list of expressions as well.
- df.repartition(numPartitions) does RoundRobin partitioning
- df.repartition(numPartitions, Seq.empty: _*) does HashPartitioning on a constant, effectively moving all tuples to a single partition
Not only is this inconsistent, but the latter behavior is very undesirable: it may hide problems in small-scale prototype code, but will inevitably fail (or have terrible performance) in production.
I suggest we should make it:
- either throw an IllegalArgumentException
- or do RoundRobin partitioning, just like df.repartition(numPartitions)