Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
2.4.5
-
None
-
None
Description
Background:
I have a cluster (2.4.5) using standalone mode orchestrated by Nomad jobs
running on EC2. We deploy a Scala web server as a long running jar via
spark-submit in client mode. Sometimes we get into a state where the
application ends up with 0 cores due to our in-house autoscaler scaling down
and killing workers without checking if any of the cores in the worker were
allocated to existing applications. These applications then end up with 0
cores, even though there are healthy workers in the cluster.
However, if i submit a new application or register a new worker in the
cluster, only then will the master finally reallocate cores to the
application. This is problematic, because the long running 0 core
application is stuck.
Could this be related to the fact that schedule() is only triggered by new
workers / new applications as commented here?
https://github.com/apache/spark/blob/v2.4.5/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L721-L724
If that is the case, should the application be calling schedule() when
removing workers after calling timeOutWorkers()?
https://github.com/apache/spark/blob/v2.4.5/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L417
The downscaling causes me to see this in my logs, so i am fairly certain
timeOutWorkers() is being called:
20/06/08 11:40:56 INFO Master: Application app-20200608114056-0006 requested to set total executors to 1. 20/06/08 11:40:56 INFO Master: Launching executor app-20200608114056-0006/0 on worker worker-20200608113523-<IP_ADDRESS>-7077 20/06/08 11:41:44 WARN Master: Removing worker-20200608113523-<IP_ADDRESS>-7077 because we got no heartbeat in 60 seconds 20/06/08 11:41:44 INFO Master: Removing worker worker-20200608113523-<IP_ADDRESS>-7077 on <IP_ADDRESS>:7077 20/06/08 11:41:44 INFO Master: Telling app of lost executor: 0 20/06/08 11:41:44 INFO Master: Telling app of lost worker: worker-20200608113523-10.158.242.213-7077