[~billie.rinaldi] raised a very important point on a potential memory issue in
I wanted to capture her point and my first initial thoughts on it. Let's use this JIRA to discuss further on this topic and find the best solution.
Billie's question: Do you think this will cause memory issues for long-lived AMs?
Gour's initial thoughts: I agree with you that any list which is only growing over time is a concern for possible memory issues. However I checked the size of a single container diagnostics payload and it hovers anywhere between 4-5 KB. So for about 100,000 containers it will end up consuming ~500MB. This is at the borderline of acceptability for a 1GB AM container. However for most production clusters I have seen that the min size of a container is set to 4GB or higher. Either way, 100K containers for a single app (even if running for years) is very unlikely but not impossible. We can do couple of things here. 1) Provide an API which can be triggered to drop all container diagnostics of the old/dead containers except n most recent ones (n can be passed as a parameter to the API). 2) Add logic where the AM will cap the no of old/dead containers to a limit of say 10,000 (which will be configurable per application). Nevertheless, if an app is created with 100K+ containers we can still be hosed, but here we are stretching our imaginations too much Anyway I don't think we should use this patch to solve this. I am going to create a new sub-task for this possible memory issue.