Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
2.0.0, 2.1.0, 2.2.0
-
None
Description
During worker startup, it starts many threads including executor threads, and then registers two shutdown hooks, first hook will invoke worker::shutdown, the second hook will sleep for 3 seconds before force halting the whole process.
We have seen a case where a deadlock happened. Imaging there is a bolt in a topology consistently fails at prepare stage and throws RuntimeException. This thread will eventually invoke Runtime.getRuntime().exit. The code path is:
There are three scenarios here.
Scenario 1
If two shutdown hooks are registered before this bolt's prepare method is invoked, when this bolt throws RuntimeException, it eventually invokes Runtime.getRuntime().exit, which triggers two shutdown hooks to start. And then this bolt executor thread waits for these two shutdown hooks to finish.
https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/jdk8u242-b08/jdk/src/share/classes/java/lang/ApplicationShutdownHooks.java#L104-L111 (we use this openJDK8u242 version)
However, what the first shutdown hooks does is to invoke worker::shutdown method
https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/daemon/worker/Worker.java#L467-L469
which will interrupt all executor threads and then wait for them to finish
However, this bolt executor thread ignores InterruptedException and continues to wait for the first hook to finish. Hence there is a dead lock between the first shutdown hook and the bolt executor thread. After 3 seconds, the second shutdown hook force halting the worker process. So in the log, you will see "Forcing Halt..."
Scenario 2
If the bolt executor thread invokes prepare method earlier than the main thread registers these two shutdown hooks, because the bolt executor thread already triggers shutdown, the main thread will receive an IllegalStateException("Shutdown in progress")
2020-06-18 11:57:29.159 o.a.s.u.Utils main [ERROR] Received error in thread main.. terminating server... java.lang.Error: java.lang.IllegalStateException: Shutdown in progress
Since there is no shutdown hook registered, Runtime.getRuntime.exit invoked by the bolt executor continues; in the meantime, the main thread also invokes Runtime.getRuntime.exit because of IllegalStateException. Eventually the process dies.
Scenario 3
This is the worse case. In this scenario, after the first shutdown hook is registered, and before the second hook is registered, the bolt executor thread invokes prepare method and throws a RuntimeException.
https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/utils/Utils.java#L355-L356
In this case, we know have the deadlock between the first shutdown hook and the bolt executor thread (like in scenario 1). The main thread will also have IllegalStateException when it tries to register the second shutdown hook, so main thread also invokes Runtime.getRuntime.exit (like in scenario 2). But this time, since the bolt executor thread already obtained the shutdown lock (because it invokes Runtime.getRuntime.exit earlier),
https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/jdk8u242-b08/jdk/src/share/classes/java/lang/Shutdown.java#L208-L214
the main thread has to wait for the bolt executor thread to release this Shutdown.class lock. However, there is a deadlock between bolt executor thread and the first shutdown hook(thread), the main thread, the bolt executor thread and the shutdown hook are all BLOCKED.
And in this case, since the second shutdown hook (sleepKill) is not registered, and every other threads like heartbeat timers still work (not yet closed), no one (not worker itself, not nimbus or supervisor) will kill this worker.
So this worker stays alive but it does not function. And since this executor bolt is blocked, it doesn't consume tuples from the receiveQ, so the receiveQ can be quickly fill up by upstreams, the credential refresh thread is also blocked because it won't give up until the credential tuple is published to the receiveQ.
How to produce the deadlock
It can be produced by modifying the WordCountTopology to add
@Override public void prepare(Map<String, Object> topoConf, TopologyContext context) { throw new RuntimeException("Runtime exception at WordCountBolt prepare"); }
in the WordCount bolt so everytime the prepare method is called, this bolt throws a RuntimeException.
Optionally, add a delay in between registering two shutdown hooks can help reproduce scenario 3.
Runtime.getRuntime().addShutdownHook(wrappedFunc); try { Thread.sleep(100); } catch (InterruptedException e) { LOG.info("Sleep for 100ms between hooks", e); } LOG.info("Wait for 100ms"); Runtime.getRuntime().addShutdownHook(sleepKill);
Solution
There are better ways to deal with this issue. But a simple way is to register shutdown hooks before any executor threads are started to avoid IllegalStateException; And in
https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/executor/ExecutorShutdown.java#L98-L101
change it be a timed join so it won't wait indefinitely. This can prevent the deadlock from happening.