Details
-
Story
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
3.4.0
-
None
Description
Today, Spark’s executeTake() code allow for the limitScaleUpFactor to be customized but does not allow for the initial number of partitions to be customized: it’s currently hardcoded to 1.
We should add a configuration so that the initial partition count can be customized. By setting this new configuration to a high value we could effectively mitigate the “run multiple jobs” overhead in take behavior. We could also set it to higher-than-1-but-still-small values (like, say, 10) to achieve a middle-ground trade-off.
Essentially, we need to make numPartsToTry = 1L (code) customizable. We should do this via a new SQL conf, similar to the limitScaleUpFactor conf.
Spark has several near-duplicate versions of this code (see code search) in:
- SparkPlan
- RDD
- pyspark rdd
Also, in pyspark limitScaleUpFactor is not supported either. So for now, I will focus on scala side first, leaving python side untouched and meanwhile sync with pyspark members. Depending on the progress we can do them all in one PR or make scala side change first and leave pyspark change as a follow-up.
Attachments
Issue Links
- duplicates
-
SPARK-37605 Support the configuration of the initial number of scan partitions when executing a take on a query
- Resolved
- split to
-
SPARK-40238 support scaleUpFactor and initialNumPartition in pyspark rdd API
- In Progress
- links to