Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-19755

Blacklist is always active for MesosCoarseGrainedSchedulerBackend. As result - scheduler cannot create an executor after some time.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.1.0
    • None
    • Mesos, Scheduler, Spark Core
    • mesos, marathon, docker - driver and executors are dockerized.

    Description

      When for some reason task fails - MesosCoarseGrainedSchedulerBackend increased failure counter for a slave where that task was running.
      When counter is >=2 (MAX_SLAVE_FAILURES) mesos slave is excluded.
      Over time scheduler cannot create a new executor - every slave is is in the blacklist. Task failure not necessary related to host health- especially for long running stream apps.
      If accepted as a bug: possible solution is to use: spark.blacklist.enabled to make that functionality optional and if it make sense MAX_SLAVE_FAILURES also can be configurable.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              timout Timur Abakumov
              Votes:
              2 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: