Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-48997

Maintenance thread pool error should not cause the entire executor to crash

    XMLWordPrintableJSON

Details

    Description

      Today, it's possible for an exception within a thread in the maintenance pool to cause the entire executor to crash. Here's how:

      1. An error occurs in a maintenance pool thread
      2. It gets passed to the maintenance task thread, which `throw`s it
      3. That gets caught by `onError`, which `.stop()`s the maintenance thread pool
      4. If any of the maintenance pool threads are waiting on a lock, they will receive an `InterruptedException` (this happens if they are verifying if the their state store instance is active)
      5. This `InterruptedException` is not caught, which is not `NonFatal`
      6. This uncaught exception bubbles all the way to the `SparkUncaughtExceptionHandler`, causing the executor to exit

      A fix that is better is to modify the maintenance thread pool to only `unload` providers that experience errors, not stop the entire thread pool.

      Attachments

        Issue Links

          Activity

            People

              neilramaswamy Neil Ramaswamy
              neilramaswamy Neil Ramaswamy
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: