Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
2.1.1
-
None
Description
When a task fails it is (1) added into the pending task list and then (2) corresponding black list policy is enforced (ie, specifying if it can/can't run on a particular node/executor/etc.) Unfortunately the ordering is such that retrying the task could assign the task to the same executor, which, incidentally could be shutting down and immediately fail the retry. Instead the order should be (1) the black list state should be updated and then (2) the task assigned, ensuring that the black list policy is properly enforced.
The attached logs demonstrate the race condition.
See spark_executor.log.anon:
1. Task 55.2 fails on the executor
17/06/20 13:25:07 ERROR Executor: Exception in task 55.2 in stage 5.0 (TID 39575)
java.lang.OutOfMemoryError: Java heap space
2. Immediately the same executor is assigned the retry task:
17/06/20 13:25:07 INFO CoarseGrainedExecutorBackend: Got assigned task 39651
17/06/20 13:25:07 INFO Executor: Running task 55.3 in stage 5.0 (TID 39651)
3. The retry task of course fails since the executor is also shutting down due to the original task 55.2 OOM failure.
See the spark_driver.log.anon:
The driver processes the lost task 55.2:
17/06/20 13:25:07 WARN TaskSetManager: Lost task 55.2 in stage 5.0 (TID 39575, foobar####.masked-server.com, executor attempt_foobar####.masked-server.com-################.masked-server.com-################_0): java.lang.OutOfMemoryError: Java heap space
The driver then receives the ExecutorLostFailure for the retry task 55.3 (although it's obfuscated in these logs, the server info is same...)
17/06/20 13:25:10 WARN TaskSetManager: Lost task 55.3 in stage 5.0 (TID 39651, foobar####.masked-server.com, executor attempt_foobar####.masked-server.com-################.masked-server.com-################0): ExecutorLostFailure (executor attempt_foobar####.masked-server.com-################.masked-server.com-############_####_0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.