root (cluster: 1000GB, 1000 vcores)
- q1 (maxResources: 10GB, 10 vcores)
- q1.1 (weight: 1)
- q1.2 (weight: 9)
- app1 with a demand 100GB/100 vcores is added to q1.1 => it gets 10GB/10 vcores
- q1 reaches it's max
- app2 with a demand 1000GB/1000 vcores is added to q2 => it gets 990GB/990 vcores
- cluster runs at 100% capacity now
- app3 with demand 100GB/100 vcores is added to q1.2 => ...
Expected behavior: fair share preemption preempts containers from app1 (q1.1) so app3 (q1.2) gets 9GB/9 vcores according to the weight.
Observed behavior: app3 is starving
- We see some preemption happening from app2 (q2) that matches app3 starvation (9GB/9 vcores in this case). It may suggest app2 preempts from app3 but can't use preempted containers due to this check. Also if a container for preemption is random, it way more likely to be preempted from app2 compared to app1 due to allocation size.
- Eliminating max on q1 helps to resolve the issue but we need to keep the max
- this is oversimplified version of our production set up. I can provide more details if needed.
- I have a heap dump of the issue that I can't share due because of our policy, but I can look up some state if needed.
- My co-worker reported a bug for the same issue YARN-11171, please feel free to close it as a duplicate.