[SPARK-20178] Improve Scheduler fetch failures - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Epic
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 2.1.0
Fix Version/s: None
Component/s: Scheduler, Spark Core
Labels:
- bulk-closed

Epic Name:
Fetch Failure Improvements

Description

We have been having a lot of discussions around improving the handling of fetch failures. There are 4 jira currently related to this.

We should try to get a list of things we want to improve and come up with one cohesive design.

~~SPARK-20163~~, ~~SPARK-20091~~, ~~SPARK-14649~~ , and ~~SPARK-19753~~

I will put my initial thoughts in a follow on comment.

Attachments

Issue Links

is related to

SPARK-2666 Always try to cancel running tasks when a stage is marked as zombie

Resolved

SPARK-20230 FetchFailedExceptions should invalidate file caches in MapOutputTracker even if newer stages are launched

Resolved

SPARK-20832 Standalone master should explicitly inform drivers of worker deaths and invalidate external shuffle service outputs

Resolved

SPARK-2387 Remove the stage barrier for better resource utilization

Resolved

relates to

SPARK-13669 Job will always fail in the external shuffle service unavailable situation

Resolved

SPARK-14649 DagScheduler re-starts all running tasks on fetch failure

Resolved

SPARK-19753 Remove all shuffle files on a host in case of slave lost of fetch failure

Resolved

SPARK-20115 Fix DAGScheduler to recompute all the lost shuffle blocks when external shuffle service is unavailable

Resolved

SPARK-20163 Kill all running tasks in a stage in case of fetch failure

Closed

SPARK-20091 DagScheduler should allow running concurrent attempts of a stage in case of multiple fetch failure

Resolved

links to

design doc

(5 relates to, 1 links to)

Activity

People

Assignee:: Unassigned

Reporter:: Thomas Graves

Votes:: 3 Vote for this issue

Watchers:: 35 Start watching this issue

Dates

Created:: 31/Mar/17 13:25

Updated:: 15/Jan/22 23:39

Resolved:: 21/May/19 04:15