[SPARK-13669] Job will always fail in the external shuffle service unavailable situation - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.3.0
Component/s: Scheduler, Spark Core, YARN
Labels:
None

Description

Currently we are running into an issue with Yarn work preserving enabled + external shuffle service.

In the work preserving enabled scenario, the failure of NM will not lead to the exit of executors, so executors can still accept and run the tasks. The problem here is when NM is failed, external shuffle service is actually inaccessible, so reduce tasks will always complain about the “Fetch failure”, and the failure of reduce stage will make the parent stage (map stage) rerun. The tricky thing here is Spark scheduler is not aware of the unavailability of external shuffle service, and will reschedule the map tasks on the executor where NM is failed, and again reduce stage will be failed with “Fetch failure”, and after 4 retries, the job is failed.

So here the main problem is that we should avoid assigning tasks to those bad executors (where shuffle service is unavailable). Current Spark's blacklist mechanism could blacklist executors/nodes by failure tasks, but it doesn't handle this specific fetch failure scenario. So here propose to improve the current application blacklist mechanism to handle fetch failure issue (especially with external shuffle service unavailable issue), to blacklist the executors/nodes where shuffle fetch is unavailable.

Attachments

Issue Links

is duplicated by

SPARK-22426 Spark AM launching containers on node where External spark shuffle service failed to initialize

Resolved

is related to

SPARK-20178 Improve Scheduler fetch failures

Resolved

links to

[Github] Pull Request #17113 (jerryshao)

Activity

People

Assignee:: Saisai Shao

Reporter:: Saisai Shao

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 04/Mar/16 06:56

Updated:: 10/Nov/17 03:50

Resolved:: 26/Jun/17 16:17