[SPARK-19560] Improve tests for when DAGScheduler learns of "successful" ShuffleMapTask from a failed executor - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Test
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.1.1
Fix Version/s: None
Component/s: Scheduler, Spark Core
Labels:
None

Target Version/s:

2.2.0

Description

There's some tricky code around the case when the DAGScheduler learns of a ShuffleMapTask that completed successfully, but ran on an executor that failed sometime after the task was launched. This case is tricky because the TaskSetManager (i.e., the lower level scheduler) thinks the task completed successfully, but the DAGScheduler considers the output it generated to be no longer valid (because it was probably lost when the executor was lost). As a result, the DAGScheduler needs to re-submit the stage, so that the task can be re-run. This is tested in some of the tests but not clearly documented, so we should improve this to prevent future bugs (this was encountered by markhamstra in attempting to find a better fix for ~~SPARK-19263~~).

Attachments

Issue Links

links to

[Github] Pull Request #16892 (kayousterhout)

Activity

People

Assignee:: Kay Ousterhout

Reporter:: Kay Ousterhout

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 11/Feb/17 05:45

Updated:: 17/May/20 17:47

Resolved:: 24/Feb/17 19:44