Status: Resolved
Resolution: Fixed
YARN Capacity Scheduler does not kick Preemption under below scenario.
Two queues A and B each with 50% capacity and 100% maximum capacity and user limit factor 2. Minimum Container size is 1536MB and total cluster resource is 40GB. Now submit the first job which needs 1536MB for AM and 9 task containers each 4.5GB to queue A. Job will get 8 containers total (AM 1536MB + 7 * 4.5GB = 33GB) and the cluster usage is 93.8% and the job has reserved a container of 4.5GB.
Now when next job (1536MB for AM and 9 task containers each 4.5GB) is submitted onto queue B. The job hangs in ACCEPTED state forever and RM scheduler never kicks in Preemption. (RM UI Image 2 attached)
Test Case:
./spark-submit --class org.apache.spark.examples.SparkPi --master yarn-client --queue A --executor-memory 4G --executor-cores 4 --num-executors 9 ../lib/spark-examples*.jar 1000000
After a minute..
./spark-submit --class org.apache.spark.examples.SparkPi --master yarn-client --queue B --executor-memory 4G --executor-cores 4 --num-executors 9 ../lib/spark-examples*.jar 1000000
Credit to: [~Prabhu Joseph] for bug investigation and troubleshooting.