Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-33235 Push-based Shuffle Improvement Tasks
  3. SPARK-37313

Child stage using merged output or not should be based on the availability of merged output from parent stage

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsAdd voteVotersWatch issueWatchersConvert to IssueMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete CommentsDelete
    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 3.2.1
    • None
    • Shuffle, Spark Core
    • None

    Description

      As discussed in the thread in SPARK-37023, during a stage retry, if parent stage has already generated merged output in the previous attempt, with current behavior, the child stage would not able to fetch the merged output, as this is controlled by dependency.shuffleMergeEnabled (see current implementation here) during the stage retry.

      Instead of using a single variable to control behavior at both mapper side (push side) and reducer side (using merged output), whether child stage uses merged output or not must only be based on whether merged output is available for it to use(as discussed here).

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned Assign to me
            minyang Minchu Yang

            Dates

              Created:
              Updated:

              Slack

                Issue deployment