I ran into a deadlock dealing with blacklisting where the application was stuck in a state that it thought it had executors to schedule on but it really didn't because those executors were dead. So all executors were blacklisted and it couldn't run any tasks.
Note for this to happen the application has to be in a state where many tasks are going to fail (at least enough tasks fail to cause all executors to be blacklisted) and an executor had to get a heartbeat timeout.
Small disclaimer in that I ran into this on spark 2.1 with the blacklisting feature backported. Looking at the code for master branch I don't see this fixed though.
Normally there is code TaskSetManager. abortIfCompletelyBlacklisted that prevents the app from becoming deadlocked when no executors are available. Unfortunately this did not take affect because in this case a dead executor was in TaskSchedulerImpl.hostToExecutors and it wasn't a blacklisted host because it had a heartbeat timeout so it thought it still had executors to run on.
Now how it got into this situation:
The HeartBeatReceiver was called because it timed out a based on no heartbeat. This called TaskSchedulerImpl.executorLost, this called TaskSchedulerImpl.removeExecutor which removed it from the executorIdToRunningTaskIds and hostToExecutors. It does other handling here and sends a CompletionEvent in DAG.taskEnded. TaskSchedulerImpl.removeExecutor returns and TaskSchedulerImpl.executorLost calls into DAG.executorLost and then CoarseGrainedSchedulerBackend.reviveOffers which both send messages async. This returns and HeartBeatReceiver then calls sc.killAndReplaceExecutor asynchronously.
Now there are a bunch of messages sent which are being processed by other threads. You can see the order in which things ran from the logs below. In this situation, the CoarseGrainedSchedulerBakcend.makeOffers is called before sc.killAndReplaceExecutor can add the host to executorsPendingToRemove. CoarseGrainedSchedulerBackend.makeOffers uses CoarseGrainedSchedulerBackend.executorDataMap and executorsPendingToRemove to check to see if the executor exists and it is alive. At this point its in the executorDataMap and its not Pending to remove. Since its still there it calls into TaskSchedulerImpl.resourceOffers which adds the executor back into executorIdToRunningTaskIds and hostToExecutors. There is nothing else after this to remove it from hostToExecutors so the TaskSetManager. abortIfCompletelyBlacklisted routine always sees an executor that is actually dead.
While looking at this I also found a bug where it doesn't remove the executor from the CoarseGrainedSchedulerBackend.executorDataMap as well. Normally the call to CoarseGrainedSchedulerBackend.removeExecutor would remove that but since it was a heartbeat timeout this is never called. The only way that would have helped in this situation is if it was removed from executorDataMap before the called to TaskSchedulerImpl.resourceOffers -> CoarseGrainedSchedulerBackend.makeOffers, in which case it would have never been added back to executorIdToRunningTaskIds and hostToExecutors. Note other stats like totalCoreCount and totalRegisteredExecutors will also be off since removeExecutor is not called.
Below are the logs that show the race. Notice how the "Requesting to kill executor(s) 137" came in the middle of the removal which is in CoarseGrainedSchedulerBackend.killExecutors and but it doesn't actually run through to add it to executorsPendingToRemove "Actual list of executor(s) to be killed is 137" until after it was added back to the hostToExecutors list ("Host added was in lost list earlier: foo1.com")
2017-11-14 15:56:44,376 [task-result-getter-0] INFO org.apache.spark.scheduler.TaskSetManager - Finished task 3203.0 in stage 7.0 (TID 11011) in 1396594 ms on foo3.com (executor 8) (2971/4004)
2017-11-14 15:56:50,729 [dispatcher-event-loop-23] WARN org.apache.spark.HeartbeatReceiver - Removing executor 137 with no recent heartbeats: 639438 ms exceeds timeout 600000 ms
2017-11-14 15:56:50,731 [dispatcher-event-loop-23] ERROR org.apache.spark.scheduler.cluster.YarnClusterScheduler - Lost executor 137 on foo1.com: Executor heartbeat timed out after 639438 ms
2017-11-14 15:56:50,773 [dispatcher-event-loop-23] WARN org.apache.spark.scheduler.TaskSetManager - Lost task 2696.0 in stage 7.0 (TID 11061, foo1.com, executor 137): ExecutorLostFailure (executor 137 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 639438 ms
2017-11-14 15:56:50,778 [dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Executor lost: 137 (epoch 0)
2017-11-14 15:56:50,779 [dispatcher-event-loop-20] INFO org.apache.spark.storage.BlockManagerMasterEndpoint - Trying to remove executor 137 from BlockManagerMaster.
2017-11-14 15:56:50,779 [kill-executor-thread] INFO org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend - Requesting to kill executor(s) 137
2017-11-14 15:56:50,780 [dispatcher-event-loop-20] INFO org.apache.spark.storage.BlockManagerMasterEndpoint - Removing block manager BlockManagerId(137, foo1.com, 42569, None)
2017-11-14 15:56:50,780 [dag-scheduler-event-loop] INFO org.apache.spark.storage.BlockManagerMaster - Removed 137 successfully in removeExecutor
2017-11-14 15:56:50,781 [dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Host added was in lost list earlier: foo1.com
2017-11-14 15:56:50,782 [kill-executor-thread] INFO org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend - Actual list of executor(s) to be killed is 137
2017-11-14 15:56:50,827 [dispatcher-event-loop-6] INFO org.apache.spark.deploy.yarn.ApplicationMaster$AMEndpoint - Driver requested to kill executor(s) 137.
2017-11-14 15:56:51,616 [Reporter] INFO org.apache.spark.deploy.yarn.YarnAllocator - Will request 1 executor container(s), each with 10 core(s) and 15360 MB memory (including 8192 MB of overhead)
Note there may be another bug here with speculation as well because some time later I see this same host in a log message that makes it sound like it thinks the task is still running on that dead host:
2017-11-14 16:46:40,970 [task-scheduler-speculation] INFO org.apache.spark.scheduler.TaskSetManager - Marking task 2696 in stage 7.0 (on gsbl560n05.blue.ygrid.yahoo.com) as speculatable because it ran more than 2254168 ms
We can split this off into another jira if we find its actually a bug. It looks like the issue there is the TaskSetManager.copiesRunning is not decremented because we are using external shuffle service but needs further investigation.