Uploaded image for project: 'Apache Storm'
  1. Apache Storm
  2. STORM-2194

ReportErrorAndDie doesn't always die

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      I've been trying to track down a cause of some of our issues with some exceptions leaving Storm workers in a zombified state for some time. I believe I've isolated the bug to the behaviour in :report-error-and-die/reportErrorAndDie in the executor. Essentially:

           :report-error-and-die (fn [error]
                                   (try
                                     ((:report-error <>) error)
                                     (catch Exception e
                                       (log-message "Error while reporting error to cluster, proceeding with shutdown")))
                                   (if (or
                                          (exception-cause? InterruptedException error)
                                          (exception-cause? java.io.InterruptedIOException error))
                                     (log-message "Got interrupted excpetion shutting thread down...")
                                     ((:suicide-fn <>))))
      

      has the grouping for the if statement slightly wrong. It shouldn't log OR die from InterruptedException/InterruptedIOException, but it should log under that condition, and ALWAYS die.

      Basically:

           :report-error-and-die (fn [error]
                                   (try
                                     ((:report-error <>) error)
                                     (catch Exception e
                                       (log-message "Error while reporting error to cluster, proceeding with shutdown")))
                                   (if (or
                                          (exception-cause? InterruptedException error)
                                          (exception-cause? java.io.InterruptedIOException error))
                                     (log-message "Got interrupted excpetion shutting thread down..."))
                                   ((:suicide-fn <>)))
      

      After digging into the Java port of this code, it looks like a different bug was introduced while porting:

              if (Utils.exceptionCauseIsInstanceOf(InterruptedException.class, e)
                      || Utils.exceptionCauseIsInstanceOf(java.io.InterruptedIOException.class, e)) {
                  LOG.info("Got interrupted exception shutting thread down...");
                  suicideFn.run();
              }
      

      Was how this was initially ported, and STORM-2142 changed this to:

              if (Utils.exceptionCauseIsInstanceOf(InterruptedException.class, e)
                      || Utils.exceptionCauseIsInstanceOf(java.io.InterruptedIOException.class, e)) {
                  LOG.info("Got interrupted exception shutting thread down...");
              } else {
                  suicideFn.run();
              }
      

      However, I believe the correct port is as described above:

              if (Utils.exceptionCauseIsInstanceOf(InterruptedException.class, e)
                      || Utils.exceptionCauseIsInstanceOf(java.io.InterruptedIOException.class, e)) {
                  LOG.info("Got interrupted exception shutting thread down...");
              }
              suicideFn.run();
      

      I'll look into providing patches for the 1.x and 2.x branches shortly.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            ppoulosk Paul Poulosky
            chawco Craig Hawco
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 2h 50m
                2h 50m

                Slack

                  Issue deployment