Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
2.0.0
-
None
Description
StormServerHandler used by Pacemaker Server (and by the Netty Server in each Worker) is fragile when handling certain Exceptions derived from IOException.
In Storm1 the same handler would ignore Exceptions and only terminate for serious JVM exceptions such as OutOfMemory.
The same in Storm2 does something similar but, instead of ignoring all 'regular' Exceptions, has a set of ALLOWED_EXCEPTIONS which can be ignored but this currently contains just IOException.
The code, as it currently stands, will only ignore specifically IOException. All other exceptions will cause the runtime to terminate after logging "Received error in netty thread.. terminating server..."
When a connection from a worker to the Pacemaker Server terminates - either expected (e.g. killing a topology) or unexpected (e.g. node in cluster rebooting) - a SocketException is likely to be seen by Pacemaker Server. This will cause it to terminate.
Now, as SocketException is derived from IOException, I would say a more robust way for Pacemaker Server to handle this and achieve similar stability seen with Storm1 is to not only 'swallow' IOExceptions but any exception derived from IOException too (which will of course include SocketException).
Modifying the handleUncaughtException to make use of Utils.exceptionCauseIsInstanceOf would greatly enhance the stability of Pacemaker and, as StormServerHandler is used in the Worker's Netty Server, the Workers would also have greater stability from networking exceptions (e.g. a Worker receiving a transfer from a remote where the remote reboots should no longer cause the receiving Worker to restart - we do sometimes see a cascade of worker restarts under such scenarios)
I have modified a build with such a change and can indeed see greater stability from Pacemaker Server.
I will have a pull request for the changes I have made linked to this issue soon.