Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.1.0
-
None
-
None
Description
https://github.com/apache/spark/pull/21758#discussion_r204917245
The number of partitions from the input data can be unexpectedly large, eg. if you do
sc.textFile(...).barrier().mapPartitions()
The number of input partitions is based on the hdfs input splits. We shall provide a way in RDDBarrier to enable users to specify the number of tasks in a barrier stage. Maybe something like RDDBarrier.coalesce(numPartitions: Int) .