Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-40211

Allow executeTake() / collectLimit's number of starting partitions to be customized

    XMLWordPrintableJSON

Details

    • Story
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.4.0
    • 3.4.0
    • Spark Core, SQL
    • None

    Description

      Today, Spark’s executeTake() code allow for the limitScaleUpFactor to be customized but does not allow for the initial number of partitions to be customized: it’s currently hardcoded to 1.

      We should add a configuration so that the initial partition count can be customized. By setting this new configuration to a high value we could effectively mitigate the “run multiple jobs” overhead in take behavior. We could also set it to higher-than-1-but-still-small values (like, say, 10) to achieve a middle-ground trade-off.

       

      Essentially, we need to make numPartsToTry = 1L (code) customizable. We should do this via a new SQL conf, similar to the limitScaleUpFactor conf.

       

      Spark has several near-duplicate versions of this code (see code search) in:

      • SparkPlan
      • RDD
      • pyspark rdd

      Also, in pyspark  limitScaleUpFactor  is not supported either. So for now, I will focus on scala side first, leaving python side untouched and meanwhile sync with pyspark members. Depending on the progress we can do them all in one PR or make scala side change first and leave pyspark change as a follow-up.

       

      Attachments

        Issue Links

          Activity

            People

              liuzq12 Ziqi Liu
              liuzq12 Ziqi Liu
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: