Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-48997

Maintenance thread pool error should not cause the entire executor to crash

    XMLWordPrintableJSON

Details

    Description

      Today, it's possible for an exception within a thread in the maintenance pool to cause the entire executor to crash. Here's how:

      1. An error occurs in a maintenance pool thread
      2. It gets passed to the maintenance task thread, which `throw`s it
      3. That gets caught by `onError`, which `.stop()`s the maintenance thread pool
      4. If any of the maintenance pool threads are waiting on a lock, they will receive an `InterruptedException` (this happens if they are verifying if the their state store instance is active)
      5. This `InterruptedException` is not caught, which is not `NonFatal`
      6. This uncaught exception bubbles all the way to the `SparkUncaughtExceptionHandler`, causing the executor to exit

      A fix that is better is to modify the maintenance thread pool to only `unload` providers that experience errors, not stop the entire thread pool.

      Attachments

        Activity

          People

            neilramaswamy Neil Ramaswamy
            neilramaswamy Neil Ramaswamy
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: