-
Type:
Bug
-
Status: Open
-
Priority:
Major
-
Resolution: Unresolved
-
Affects Version/s: None
-
Fix Version/s: None
-
Component/s: resourcemanager
-
Labels:None
-
Target Version/s:
We have noticed in production two situations that can cause deadlocks and cause scheduling of new containers to come to a halt, especially with regard to applications that have a lot of live containers:
- When these applicaitons release these containers in bulk.
- When these applications terminate abruptly due to some failure, the scheduler releases all its live containers in a loop.
To handle the issues mentioned above, we have a patch in production to make sure ALL container releases happen asynchronously - and it has served us well.
Opening this JIRA to gather feedback on if this is a good idea generally (cc Wangda Tan, Jason Darrell Lowe, Carlo Curino, Karthik Kambatla, Subramaniam Krishnan, Roni Burd)
BTW, In YARN-6251, we already have an asyncReleaseContainer() in the AbstractYarnScheduler and a corresponding scheduler event, which is currently used specifically for the container-update code paths (where the scheduler realeases temp containers which it creates for the update)