Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.1.2
-
None
Description
YARN-10467 fixed ContainerIdPBImpl Object Leakage in RMNodeImpl.completedContainers.
After applying YARN-10467 patch and operating cluster with large number of nodes, we found similar heap leakage still exists.
In heap dump which are dumped after failover, (so it is not active RM) about 4.5G is used by ContainerIDPBImpl on RMNodeImpl.completedContainers.
There are two cases.
1. Apps with 'KeepContainersAcrossApplicationAttempts' is not cleared when they are failed
Even though 'KeepContainersAcrossApplicationAttempts' is set, we should clear RMAppAttemptImpl.justFinishedContainers.
If app attempt is failed and retried by next attempt, we may not need to clear RMAppAttemptImpl.justFinishedContainers because related ContainerIDPBImpl will be handed over to next attempts and eventually cleared.
However, when app is failed, there is no next attempt and heap leakage occur.
(We found this case when Yarn Service Application failed over multiple attempts because of OOM in AM)
2. Apps is killed explicitly by user
When app is killed by user by 'yarn application -kill' CLI interface or WebUI interface, RMAppAttemptImpl.amContainerFinished is not called because app and app attempt state is already changed.
To handle this, we added sendFinishedContainersToNMs for each RMAppAttemptImpl.finishedContainersSentToAm, RMAppAttemptImpl.justFinishedContainers when Attempt is set to 'KILLED'
We found and patched our cluster on 3.1.2 but it seems trunk still has the same problem.
I attached patch based on the trunk.
Thanks!
Attachments
Attachments
Issue Links
- links to