[SPARK-37313] Child stage using merged output or not should be based on the availability of merged output from parent stage - ASF JIRA

Log work

Agile Board

Rank to Top

Rank to Bottom

Attach files

Attach Screenshot

Bulk Copy Attachments

Bulk Move Attachments

Add vote

Voters

Watch issue

Watchers

Convert to Issue

Move

Link

Clone

Labels

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

Delete

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 3.2.1
Fix Version/s: None
Component/s: Shuffle, Spark Core
Labels:
None

Description

As discussed in the thread in ~~SPARK-37023~~, during a stage retry, if parent stage has already generated merged output in the previous attempt, with current behavior, the child stage would not able to fetch the merged output, as this is controlled by dependency.shuffleMergeEnabled (see current implementation here) during the stage retry.

Instead of using a single variable to control behavior at both mapper side (push side) and reducer side (using merged output), whether child stage uses merged output or not must only be based on whether merged output is available for it to use(as discussed here).