Details
-
Sub-task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
3.2.0, 3.2.1, 3.3.0
-
None
Description
Since the AS-IS timedOutExecutors returns the result indeterministic, it kills the executors in a random order at Dynamic Allocation setting.
spark/core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala
private val executors = new ConcurrentHashMap[String, Tracker]() ... timedOutExecs = executors.asScala
This random behavior not only makes the users confusing but also causes a K8s decommission tests flaky like the following case in Java 17 on Apple Silicon environment. The K8s test expects the decommission of executor 1 while the executor 2 is chosen at this time.
22/01/25 06:11:16 DEBUG ExecutorMonitor: Executors 1,2 do not have active shuffle data after job 0 finished. 22/01/25 06:11:16 DEBUG ExecutorAllocationManager: max needed for rpId: 0 numpending: 0, tasksperexecutor: 1 22/01/25 06:11:16 DEBUG ExecutorAllocationManager: No change in number of executors 22/01/25 06:11:16 DEBUG ExecutorAllocationManager: Request to remove executorIds: (2,0), (1,0) 22/01/25 06:11:16 DEBUG ExecutorAllocationManager: Not removing idle executor 1 because there are only 1 executor(s) left (minimum number of executor limit 1) 22/01/25 06:11:16 INFO KubernetesClusterSchedulerBackend: Decommission executors: 2