My concern is that if don't fix the root-cause, though we've protected ourselves from crashes, we'd just be queueing a lot of aggregation processes and causing long waiting times.
Agree. We do see NM log aggregation service launch many active threads which keep large number of TCP connections to DN which use out system's file limit. We can fix shared limited thread number here, but the TCP connections problem may not solved by this patch.
Upon NM restart, NM will try to recover all applications and submit a log aggregation task to the thread pool for each application recovered. Therefore, a large number of recovered applications plus concurrent applications can cause the thread pool to increase without a bound.
Does all these applications are active one or finished already? I suspect we are leaking finished applications in NM state store in recover process. I noticed this issue in filing
YARN-4325 but lost my progress as previous long running cluster is gone. Haibo Chen, could you check if your case is the same here?
In general, I think the fix on this JIRA is OK. But I agree with Vinod that we should dig out more on the root cause or it could be other holes (like TCP connection leaking mentioned above).