[SPARK-40211] Allow executeTake() / collectLimit's number of starting partitions to be customized - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Story
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.4.0
Fix Version/s: 3.4.0
Component/s: Spark Core, SQL
Labels:
None

Description

Today, Spark’s executeTake() code allow for the limitScaleUpFactor to be customized but does not allow for the initial number of partitions to be customized: it’s currently hardcoded to 1.

We should add a configuration so that the initial partition count can be customized. By setting this new configuration to a high value we could effectively mitigate the “run multiple jobs” overhead in take behavior. We could also set it to higher-than-1-but-still-small values (like, say, 10) to achieve a middle-ground trade-off.

Essentially, we need to make numPartsToTry = 1L (code) customizable. We should do this via a new SQL conf, similar to the limitScaleUpFactor conf.

Spark has several near-duplicate versions of this code (see code search) in:

SparkPlan
RDD
pyspark rdd

Also, in pyspark limitScaleUpFactor is not supported either. So for now, I will focus on scala side first, leaving python side untouched and meanwhile sync with pyspark members. Depending on the progress we can do them all in one PR or make scala side change first and leave pyspark change as a follow-up.

Attachments

Issue Links

duplicates

SPARK-37605 Support the configuration of the initial number of scan partitions when executing a take on a query

Resolved

split to

SPARK-40238 support scaleUpFactor and initialNumPartition in pyspark rdd API

In Progress

links to

[Github] Pull Request #37661 (liuzqt)

Activity

People

Assignee:: Ziqi Liu

Reporter:: Ziqi Liu

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 24/Aug/22 20:25

Updated:: 23/May/23 07:47

Resolved:: 27/Aug/22 00:19