[SPARK-8167] Tasks that fail due to YARN preemption can cause job failure - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 1.3.1
Fix Version/s: 1.6.0
Component/s: Scheduler, Spark Core, YARN
Labels:
None

Target Version/s:

1.6.0

Description

Tasks that are running on preempted executors will count as FAILED with an ExecutorLostFailure. Unfortunately, this can quickly spiral out of control if a large resource shift is occurring, and the tasks get scheduled to executors that immediately get preempted as well.

The current workaround is to increase spark.task.maxFailures very high, but that can cause delays in true failures. We should ideally differentiate these task statuses so that they don't count towards the failure limit.

Attachments

Issue Links

links to

[Github] Pull Request #8007 (mccheah)

Activity

People

Assignee:: Matt Cheah

Reporter:: Patrick Woody

Votes:: 1 Vote for this issue

Watchers:: 17 Start watching this issue

Dates

Created:: 08/Jun/15 19:26

Updated:: 17/May/20 17:48

Resolved:: 10/Sep/15 18:59