[SPARK-21219] Task retry occurs on same executor due to race condition with blacklisting - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.1.1
Fix Version/s: 2.2.1, 2.3.0
Component/s: Scheduler, Spark Core
Labels:
None

Description

When a task fails it is (1) added into the pending task list and then (2) corresponding black list policy is enforced (ie, specifying if it can/can't run on a particular node/executor/etc.) Unfortunately the ordering is such that retrying the task could assign the task to the same executor, which, incidentally could be shutting down and immediately fail the retry. Instead the order should be (1) the black list state should be updated and then (2) the task assigned, ensuring that the black list policy is properly enforced.

The attached logs demonstrate the race condition.

See spark_executor.log.anon:

1. Task 55.2 fails on the executor

17/06/20 13:25:07 ERROR Executor: Exception in task 55.2 in stage 5.0 (TID 39575)
java.lang.OutOfMemoryError: Java heap space

2. Immediately the same executor is assigned the retry task:

17/06/20 13:25:07 INFO CoarseGrainedExecutorBackend: Got assigned task 39651
17/06/20 13:25:07 INFO Executor: Running task 55.3 in stage 5.0 (TID 39651)

3. The retry task of course fails since the executor is also shutting down due to the original task 55.2 OOM failure.

See the spark_driver.log.anon:

The driver processes the lost task 55.2:

17/06/20 13:25:07 WARN TaskSetManager: Lost task 55.2 in stage 5.0 (TID 39575, foobar####.masked-server.com, executor attempt_foobar####.masked-server.com-################.masked-server.com-################_0): java.lang.OutOfMemoryError: Java heap space

The driver then receives the ExecutorLostFailure for the retry task 55.3 (although it's obfuscated in these logs, the server info is same...)

17/06/20 13:25:10 WARN TaskSetManager: Lost task 55.3 in stage 5.0 (TID 39651, foobar####.masked-server.com, executor attempt_foobar####.masked-server.com-################.masked-server.com-################0): ExecutorLostFailure (executor attempt_foobar####.masked-server.com-################.masked-server.com-############_####_0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

spark_executor.log.anon
26/Jun/17 21:42
8 kB
Eric Vandenberg
spark_driver.log.anon
26/Jun/17 21:42
4 kB
Eric Vandenberg

Issue Links

links to

[Github] Pull Request #18427 (ericvandenbergfb)

[Github] Pull Request #18604 (jsoltren)

Activity

People

Assignee:: Eric Vandenberg

Reporter:: Eric Vandenberg

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 26/Jun/17 21:41

Updated:: 17/May/20 17:46

Resolved:: 10/Jul/17 06:41