Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-24413

Executor Blacklisting shouldn't immediately fail the application if dynamic allocation is enabled and no active executors

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 2.3.0
    • None
    • Scheduler, Spark Core
    • None

    Description

      Currently with executor blacklisting enabled, dynamic allocation on, and you only have 1 active executor (spark.blacklist.killBlacklistedExecutors setting doesn't matter in this case, can be on or off), if you have a task fail that results in the 1 executor you have getting blacklisted, then your entire application will fail.  The error you get is something like:

      Aborting TaskSet 0.0 because task 9 (partition 9)
      cannot run anywhere due to node and executor blacklist.

      This is very undesirable behavior because you may have a huge job but one task is the long tail and if it happens to hit a bad node that would blacklist it, the entire job fail.

      Ideally since dynamic allocation is on, the schedule should not immediately fail but it should let dynamic allocation try to get more executors. 

       

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              tgraves Thomas Graves
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: