Uploaded image for project: 'Apache Storm'
  1. Apache Storm
  2. STORM-4104

Pacemaker server stability issues - e.g. shuts down when topology killed

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 2.0.0
    • 2.7.1
    • storm-server
    • None

    Description

      StormServerHandler used by Pacemaker Server (and by the Netty Server in each Worker) is fragile when handling certain Exceptions derived from IOException.

      In Storm1 the same handler would ignore Exceptions and only terminate for serious JVM exceptions such as OutOfMemory.

      The same in Storm2 does something similar but, instead of ignoring all 'regular' Exceptions, has a set of ALLOWED_EXCEPTIONS which can be ignored but this currently contains just IOException.

      The code, as it currently stands, will only ignore specifically IOException. All other exceptions will cause the runtime to terminate after logging "Received error in netty thread.. terminating server..."

      When a connection from a worker to the Pacemaker Server terminates - either expected (e.g. killing a topology) or unexpected (e.g. node in cluster rebooting) - a SocketException is likely to be seen by Pacemaker Server. This will cause it to terminate.

      Now, as SocketException is derived from IOException, I would say a more robust way for Pacemaker Server to handle this and achieve similar stability seen with Storm1 is to not only 'swallow' IOExceptions but any exception derived from IOException too (which will of course include SocketException).

      Modifying the handleUncaughtException to make use of Utils.exceptionCauseIsInstanceOf would greatly enhance the stability of Pacemaker and, as StormServerHandler is used in the Worker's Netty Server, the Workers would also have greater stability from networking exceptions (e.g. a Worker receiving a transfer from a remote where the remote reboots should no longer cause the receiving Worker to restart - we do sometimes see a cascade of worker restarts under such scenarios)

      I have modified a build with such a change and can indeed see greater stability from Pacemaker Server.

      I will have a pull request for the changes I have made linked to this issue soon.

      Attachments

        Activity

          People

            scomo Scott Moore
            scomo Scott Moore
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 0.5h
                0.5h