When for some reason task fails - MesosCoarseGrainedSchedulerBackend increased failure counter for a slave where that task was running.
When counter is >=2 (MAX_SLAVE_FAILURES) mesos slave is excluded.
Over time scheduler cannot create a new executor - every slave is is in the blacklist. Task failure not necessary related to host health- especially for long running stream apps.
If accepted as a bug: possible solution is to use: spark.blacklist.enabled to make that functionality optional and if it make sense MAX_SLAVE_FAILURES also can be configurable.