Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-33235

Push-based Shuffle Improvement Tasks

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.1.0
    • None
    • Shuffle, Spark Core

    Description

      This is the parent jira for follow-up improvement tasks for supporting Push-based shuffle. Refer SPARK-30602.

      Attachments

        Issue Links

          1.
          Limit the number of pending blocks in memory and store blocks that collide Sub-task Open Unassigned
          2.
          Pluggable API to fetch shuffle merger locations with Push based shuffle Sub-task Open Unassigned
          3.
          Better heuristics to compute number of shuffle mergers required for a ShuffleMapStage Sub-task Open Unassigned
          4.
          Improve locality for push-based shuffle especially for join like operations Sub-task In Progress Unassigned
          5.
          Enable ShuffleBlockPusher to stop pushing blocks for a particular shuffle partition Sub-task Open Unassigned
          6.
          Improve caching of MergeStatus on the executor side to save memory Sub-task In Progress Unassigned
          7.
          Improve push based shuffle to work with AQE by fetching partial map indexes for a reduce partition Sub-task Open Unassigned
          8.
          When addMergerLocation exceed the maxRetainedMergerLocations , we should remove the merger based on merged shuffle data size. Sub-task Open Unassigned
          9.
          Cancel finalizing the shuffle merge if the stage is cancelled while waiting until shuffle merge finalize wait time. Sub-task Open Unassigned
          10.
          Support push based shuffle when barrier scheduling is enabled Sub-task Open Unassigned
          11.
          Register merge status even after shuffle dependency is merge finalized Sub-task Open Unassigned
          12.
          Support IO encryption for push-based shuffle Sub-task Open Unassigned
          13.
          Child stage using merged output or not should be based on the availability of merged output from parent stage Sub-task Open Unassigned
          14.
          Replace usages of slaveTracker to workerTracker in MapOutputTrackerSuite Sub-task Open Unassigned
          15.
          Push-based shuffle's internal implementation details should not be exposed as API Sub-task Open Unassigned
          16.
          Check if shuffleMergeId is the same as the current stage's shuffleMergeId before registering MergeStatus Sub-task Open Unassigned
          17.
          Set shuffleMergeAllowed to false for a determinate stage after the stage is finalized Sub-task Open Unassigned
          18.
          JsonProtocol should skip logging of push-based shuffle read metrics when push-based shuffle is disabled Sub-task Open Unassigned

          Activity

            People

              Unassigned Unassigned
              csingh Chandni Singh
              Votes:
              1 Vote for this issue
              Watchers:
              23 Start watching this issue

              Dates

                Created:
                Updated: