[SPARK-19753] Remove all shuffle files on a host in case of slave lost of fetch failure - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.0.1
Fix Version/s: 2.3.0
Component/s: Scheduler, Spark Core
Labels:
None

Description

Currently, when we detect fetch failure, we only remove the shuffle files produced by the executor, while the host itself might be down and all the shuffle files are not accessible. In case we are running multiple executors on a host, any host going down currently results in multiple fetch failures and multiple retries of the stage, which is very inefficient. If we remove all the shuffle files on that host, on first fetch failure, we can rerun all the tasks on that host in a single stage retry.

Attachments

Issue Links

is related to

SPARK-20178 Improve Scheduler fetch failures

Resolved

links to

[Github] Pull Request #17088 (sitalkedia)

[Github] Pull Request #18150 (sitalkedia)

Activity

People

Assignee:: Sital Kedia

Reporter:: Sital Kedia

Votes:: 0 Vote for this issue

Watchers:: 14 Start watching this issue

Dates

Created:: 27/Feb/17 20:31

Updated:: 17/May/20 17:46

Resolved:: 14/Jun/17 03:34