Details
-
Epic
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
2.1.0
-
None
-
Fetch Failure Improvements
Description
We have been having a lot of discussions around improving the handling of fetch failures. There are 4 jira currently related to this.
We should try to get a list of things we want to improve and come up with one cohesive design.
SPARK-20163, SPARK-20091, SPARK-14649 , and SPARK-19753
I will put my initial thoughts in a follow on comment.
Attachments
Issue Links
- is related to
-
SPARK-2666 Always try to cancel running tasks when a stage is marked as zombie
- Resolved
-
SPARK-20230 FetchFailedExceptions should invalidate file caches in MapOutputTracker even if newer stages are launched
- Resolved
-
SPARK-20832 Standalone master should explicitly inform drivers of worker deaths and invalidate external shuffle service outputs
- Resolved
-
SPARK-2387 Remove the stage barrier for better resource utilization
- Resolved
- relates to
-
SPARK-13669 Job will always fail in the external shuffle service unavailable situation
- Resolved
-
SPARK-14649 DagScheduler re-starts all running tasks on fetch failure
- Resolved
-
SPARK-19753 Remove all shuffle files on a host in case of slave lost of fetch failure
- Resolved
-
SPARK-20115 Fix DAGScheduler to recompute all the lost shuffle blocks when external shuffle service is unavailable
- Resolved
-
SPARK-20163 Kill all running tasks in a stage in case of fetch failure
- Closed
-
SPARK-20091 DagScheduler should allow running concurrent attempts of a stage in case of multiple fetch failure
- Resolved
- links to