Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Incomplete
-
None
-
None
-
None
-
master branch. commit d6dc12ef0146ae409834c78737c116050961f350
Description
I'm fairly certain that https://github.com/apache/spark/pull/11031 introduced a deadlock in executor shutdown. The result is executor shutdown hangs indefinitely. In Mesos at least, this lasts until spark.mesos.coarse.shutdownTimeout (default 10s), at which point the driver stops, which force kills the executors.
The deadlock is as follows:
- CoarseGrainedExecutorBackend receives a Shutdown message, which now blocks on rpcEnv.awaitTermination() https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkEnv.scala#L95
- rpcEnv.awaitTermination() blocks on dispatcher.awaitTermination(), which blocks until all dispatcher threads (MessageLoop threads) terminate
- However, the initial Shutdown message handling is itself handled by a Dispatcher MessageLoop thread. This mutual dependence results in a deadlock. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rpc/netty/Dispatcher.scala#L216