[SPARK-14649] DagScheduler re-starts all running tasks on fetch failure - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: None
Fix Version/s: None
Component/s: Scheduler, Spark Core
Labels:
- bulk-closed

Description

When a fetch failure occurs, the DAGScheduler re-launches the previous stage (to re-generate output that was missing), and then re-launches all tasks in the stage with the fetch failure that hadn't completed when the fetch failure occurred (the DAGScheduler re-lanches all of the tasks whose output data is not available – which is equivalent to the set of tasks that hadn't yet completed).

The assumption when this code was originally written was that when a fetch failure occurred, the output from at least one of the tasks in the previous stage was no longer available, so all of the tasks in the current stage would eventually fail due to not being able to access that output. This assumption does not hold for some large-scale, long-running workloads. E.g., there's one use case where a job has ~100k tasks that each run for about 1 hour, and only the first 5-10 minutes are spent fetching data. Because of the large number of tasks, it's very common to see a few tasks fail in the fetch phase, and it's wasteful to re-run other tasks that had finished fetching data so aren't affected by the fetch failure (and may be most of the way through their hour-long execution). The DAGScheduler should not re-start these tasks.

Attachments

Issue Links

is related to

SPARK-20178 Improve Scheduler fetch failures

Resolved

links to

[Github] Pull Request #12436 (sitalkedia)

[Github] Pull Request #17297 (sitalkedia)

Activity

People

Assignee:: Unassigned

Reporter:: Sital Kedia

Votes:: 1 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 15/Apr/16 01:56

Updated:: 17/May/20 17:48

Resolved:: 21/May/19 04:14