Uploaded image for project: 'Apache Storm'
  1. Apache Storm
  2. STORM-3658

Problematic worker stays alive because of a deadlock and race condition caused by ShutdownHooks

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.0.0, 2.1.0, 2.2.0
    • Fix Version/s: 2.3.0
    • Component/s: storm-client
    • Labels:
      None

      Description

      During worker startup, it starts many threads including executor threads, and then registers two shutdown hooks, first hook will invoke worker::shutdown, the second hook will sleep for 3 seconds before force halting the whole process.

      https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/daemon/worker/Worker.java#L141-L147

      https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/executor/Executor.java#L253-L255

      We have seen a case where a deadlock happened. Imaging there is a bolt in a topology consistently fails at prepare stage and throws RuntimeException. This thread will eventually invoke Runtime.getRuntime().exit. The code path is:

      https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/utils/Utils.java#L406-L407

      https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/executor/error/ReportErrorAndDie.java#L41

      https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/utils/Utils.java#L510-L514

      There are three scenarios here.

      Scenario 1
      If two shutdown hooks are registered before this bolt's prepare method is invoked, when this bolt throws RuntimeException, it eventually invokes Runtime.getRuntime().exit, which triggers two shutdown hooks to start. And then this bolt executor thread waits for these two shutdown hooks to finish.

      https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/jdk8u242-b08/jdk/src/share/classes/java/lang/ApplicationShutdownHooks.java#L104-L111 (we use this openJDK8u242 version)

      However, what the first shutdown hooks does is to invoke worker::shutdown method
      https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/daemon/worker/Worker.java#L467-L469
      which will interrupt all executor threads and then wait for them to finish

      https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/executor/ExecutorShutdown.java#L95-L100

      However, this bolt executor thread ignores InterruptedException and continues to wait for the first hook to finish. Hence there is a dead lock between the first shutdown hook and the bolt executor thread. After 3 seconds, the second shutdown hook force halting the worker process. So in the log, you will see "Forcing Halt..."

      Scenario 2
      If the bolt executor thread invokes prepare method earlier than the main thread registers these two shutdown hooks, because the bolt executor thread already triggers shutdown, the main thread will receive an IllegalStateException("Shutdown in progress")

      2020-06-18 11:57:29.159 o.a.s.u.Utils main [ERROR] Received error in thread main.. terminating server...
      java.lang.Error: java.lang.IllegalStateException: Shutdown in progress
      

      https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/utils/Utils.java#L1011-L1018

      https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/jdk8u242-b08/jdk/src/share/classes/java/lang/ApplicationShutdownHooks.java#L63-L67

      Since there is no shutdown hook registered, Runtime.getRuntime.exit invoked by the bolt executor continues; in the meantime, the main thread also invokes Runtime.getRuntime.exit because of IllegalStateException. Eventually the process dies.

      Scenario 3
      This is the worse case. In this scenario, after the first shutdown hook is registered, and before the second hook is registered, the bolt executor thread invokes prepare method and throws a RuntimeException.
      https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/utils/Utils.java#L355-L356

      In this case, we know have the deadlock between the first shutdown hook and the bolt executor thread (like in scenario 1). The main thread will also have IllegalStateException when it tries to register the second shutdown hook, so main thread also invokes Runtime.getRuntime.exit (like in scenario 2). But this time, since the bolt executor thread already obtained the shutdown lock (because it invokes Runtime.getRuntime.exit earlier),
      https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/jdk8u242-b08/jdk/src/share/classes/java/lang/Shutdown.java#L208-L214

      the main thread has to wait for the bolt executor thread to release this Shutdown.class lock. However, there is a deadlock between bolt executor thread and the first shutdown hook(thread), the main thread, the bolt executor thread and the shutdown hook are all BLOCKED.

      And in this case, since the second shutdown hook (sleepKill) is not registered, and every other threads like heartbeat timers still work (not yet closed), no one (not worker itself, not nimbus or supervisor) will kill this worker.

      So this worker stays alive but it does not function. And since this executor bolt is blocked, it doesn't consume tuples from the receiveQ, so the receiveQ can be quickly fill up by upstreams, the credential refresh thread is also blocked because it won't give up until the credential tuple is published to the receiveQ.

      https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/executor/ExecutorShutdown.java#L68-L69

      https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/utils/JCQueue.java#L300-L313

      How to produce the deadlock
      It can be produced by modifying the WordCountTopology to add

      @Override
          public void prepare(Map<String, Object> topoConf, TopologyContext context) {
              throw new RuntimeException("Runtime exception at WordCountBolt prepare");
          }
      

      in the WordCount bolt so everytime the prepare method is called, this bolt throws a RuntimeException.

      Optionally, add a delay in between registering two shutdown hooks can help reproduce scenario 3.

      Runtime.getRuntime().addShutdownHook(wrappedFunc);
              try {
                  Thread.sleep(100);
              } catch (InterruptedException e) {
                  LOG.info("Sleep for 100ms between hooks", e);
              }
              LOG.info("Wait for 100ms");
              Runtime.getRuntime().addShutdownHook(sleepKill);
      

      Solution
      There are better ways to deal with this issue. But a simple way is to register shutdown hooks before any executor threads are started to avoid IllegalStateException; And in
      https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/executor/ExecutorShutdown.java#L98-L101
      change it be a timed join so it won't wait indefinitely. This can prevent the deadlock from happening.

        Attachments

          Activity

            People

            • Assignee:
              ethanli Ethan Li
              Reporter:
              ethanli Ethan Li
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 0.5h
                0.5h