Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
2.1.0
-
None
Description
The ExecutorAllocationManager.reset method is called when re-registering AM, which sets the ExecutorAllocationManager.initializing field true. When this field is true, the Driver does not start a new executor from the AM request. The following two cases will cause the field to False
1. A executor idle for some time.
2. There are new stages to be submitted
After the a stage was submitted, the AM was killed and restart ,the above two cases will not appear.
1. When AM is killed, the yarn will kill all running containers. All execuotr will be lost and no executor will be idle.
2. No surviving executor, resulting in the current stage will never be completed, DAG will not submit a new stage.
Reproduction steps:
1. Start cluster
echo -e "sc.parallelize(1 to 2000).foreach(_ => Thread.sleep(1000))" | ./bin/spark-shell --master yarn-client --executor-cores 1 --conf spark.shuffle.service.enabled=true --conf spark.dynamicAllocation.enabled=true --conf spark.dynamicAllocation.maxExecutors=2
2. Kill the AM process when a stage is scheduled.