Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-33235 Push-based Shuffle Improvement Tasks
  3. SPARK-33574

Improve locality for push-based shuffle especially for join like operations

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: In Progress
    • Major
    • Resolution: Unresolved
    • 3.1.0
    • None
    • Shuffle, Spark Core
    • None

    Description

      Currently, we only set locality for ShuffledRDD and ShuffledRowRDD with push-based shuffle.

      In simple stage DAGs where a ShuffledRDD or ShuffledRowRDD is the only input RDD, Spark can handle locality fine. However, if we have a join operation where a stage can consume multiple shuffle inputs or other non-shuffle inputs, the locality will take a hit with how DAGScheduler currently determines the preferred location.

      With push-based shuffle, we could potentially reuse the same set of merger locations across sibling ShuffleMapStages. This would enable a much better locality on the reducer stage side, where corresponding merged shuffle partitions for the multiple shuffle inputs are already colocated.

      Attachments

        Activity

          People

            Unassigned Unassigned
            mshen Min Shen
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated: