Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
4.0.0
Description
Today, it's possible for an exception within a thread in the maintenance pool to cause the entire executor to crash. Here's how:
- An error occurs in a maintenance pool thread
- It gets passed to the maintenance task thread, which `throw`s it
- That gets caught by `onError`, which `.stop()`s the maintenance thread pool
- If any of the maintenance pool threads are waiting on a lock, they will receive an `InterruptedException` (this happens if they are verifying if the their state store instance is active)
- This `InterruptedException` is not caught, which is not `NonFatal`
- This uncaught exception bubbles all the way to the `SparkUncaughtExceptionHandler`, causing the executor to exit
A fix that is better is to modify the maintenance thread pool to only `unload` providers that experience errors, not stop the entire thread pool.