[SPARK-17696] Race in CoarseGrainedExecutorBackend shutdown can lead to wrong exit status - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 1.6.0
Fix Version/s: 1.6.3
Component/s: Spark Core, YARN
Labels:
None

Description

There's a race in the shutdown path of CoarseGrainedExecutorBackend that may lead to the process exiting with the wrong status. When the race triggers, you can see things like this in the driver logs in yarn-cluster mode:

14:38:20,114 [Driver] INFO  org.apache.spark.SparkContext - Successfully stopped SparkContext

And later:

14:38:22,455 [Reporter] WARN  org.apache.spark.deploy.yarn.YarnAllocator - Container marked as failed: container_1470951093505_0001_01_000002 on host: xxx.com. Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_1470951093505_0001_01_000002
Exit code: 1

This happens because the user class is still running after the SparkContext is shut down, so the YarnAllocator instance is alive for long enough to fetch the exit status of the container. If the race is triggered, the container exits with the wrong status. In this case, enough containers hit the race that the application ended up failing due to too many container failures, even though the app would probably succeed otherwise.

The race is as follows:

CoarseGrainedExecutorBackend receives a StopExecutor
Before it can enqueue a "Shutdown" message, the socket is disconnected and NettyRpcEnv enqueues a "RemoteProcessDisconnected" message
"RemoteProcessDisconnected" is processed first, and calls "System.exit" with wrong exit code for this case.

You can see that in the executor logs: both messages are being processed.

14:38:20,093 [dispatcher-event-loop-9] INFO  org.apache.spark.executor.CoarseGrainedExecutorBackend - Driver commanded a shutdown
14:38:20,286 [dispatcher-event-loop-9] ERROR org.apache.spark.executor.CoarseGrainedExecutorBackend - Driver xxx:40988 disassociated! Shutting down.

The code needs to avoid this situation by ignoring the disconnect event if it's already shutting down.

Attachments

Issue Links

links to

[Github] Pull Request #15270 (vanzin)

Activity

People

Assignee:: Charles R Allen

Reporter:: Marcelo Masiero Vanzin

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 27/Sep/16 22:15

Updated:: 28/Sep/16 21:50

Resolved:: 28/Sep/16 21:40