Spark, when running in local mode, can encounter certain types of Error exceptions in developer-code or third-party libraries and call System.exit(), potentially killing a long-running JVM/service. The caller should decide on the exception-handling and whether the error should be deemed fatal.
Consider this scenario:
- Spark is being used in local master mode within a long-running JVM microservice, e.g. a Jetty instance.
- A task is run. The task errors with particular types of unchecked throwables:
- a) there some bad code and/or bad data that exposes a bug where there's unterminated recursion, leading to a StackOverflowError, or
- b) a particular not-often used function is called - there's a packaging error with the service, a third-party library is missing some dependencies, a NoClassDefFoundError is found.
Expected behaviour: Since we are running in local mode, we might expect some unchecked exception to be thrown, to be optionally-handled by the Spark caller. In the case of Jetty, a request thread or some other background worker thread might handle the exception or not, the thread might exit or note an error. The caller should decide how the error is handled.
Actual behaviour: System.exit() is called, the JVM exits and the Jetty microservice is down and must be restarted.
Consequence: Any local code or third-party library might cause Spark to exit the long-running JVM/microservice, so Spark can be a problem in this architecture. I have seen this now on three separate occasions, leading to service-down bug reports.
The line of code that seems to be the problem is: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L405
Utils.isFatalError() first excludes Scala NonFatal, which excludes everything except VirtualMachineError, ThreadDeath, InterruptedException, LinkageError and ControlThrowable. Utils.isFatalError() further excludes InterruptedException, NotImplementedError and ControlThrowable.
Remaining are Error s such as StackOverflowError extends VirtualMachineError or NoClassDefFoundError extends LinkageError, which occur in the aforementioned scenarios. SparkUncaughtExceptionHandler.uncaughtException() proceeds to call System.exit().
Further up in in Executor we see exclusions for registering SparkUncaughtExceptionHandler if in local mode:
This same exclusion must be applied for local mode for "fatal" errors - cannot afford to shutdown the enclosing JVM (e.g. Jetty), the caller should decide.
A minimal test-case is supplied. It installs a logging SecurityManager to confirm that System.exit() was called from SparkUncaughtExceptionHandler.uncaughtException via Executor. It also hints at the workaround - install your own SecurityManager and inspect the current stack in checkExit() to prevent Spark from exiting the JVM.
Test-case: https://github.com/javabrett/SPARK-15685 .