[SPARK-14180] Deadlock in CoarseGrainedExecutorBackend Shutdown - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Incomplete
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
- bulk-closed
Environment:

master branch. commit d6dc12ef0146ae409834c78737c116050961f350

Description

I'm fairly certain that https://github.com/apache/spark/pull/11031 introduced a deadlock in executor shutdown. The result is executor shutdown hangs indefinitely. In Mesos at least, this lasts until spark.mesos.coarse.shutdownTimeout (default 10s), at which point the driver stops, which force kills the executors.

The deadlock is as follows:

CoarseGrainedExecutorBackend receives a Shutdown message, which now blocks on rpcEnv.awaitTermination() https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkEnv.scala#L95
rpcEnv.awaitTermination() blocks on dispatcher.awaitTermination(), which blocks until all dispatcher threads (MessageLoop threads) terminate
However, the initial Shutdown message handling is itself handled by a Dispatcher MessageLoop thread. This mutual dependence results in a deadlock. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rpc/netty/Dispatcher.scala#L216

Attachments

Issue Links

links to

[Github] Pull Request #12012 (zsxwing)

Activity

People

Assignee:: Unassigned

Reporter:: Michael Gummelt

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 26/Mar/16 20:14

Updated:: 21/May/19 04:35

Resolved:: 21/May/19 04:35