Affects Version/s: 3.1.1
Fix Version/s: None
We sometimes experience an issue in which a single application, usually a Spark job, causes at least one node in a YARN cluster to become unhealthy by filling up the local dir(s) on that node past the threshold at which the node is considered unhealthy.
When this happens, the impact is potentially large depending on what else is running on that node, as all containers on that node are lost. Sometimes not much else is running on the node and it's fine, but other times we lose AM containers from other apps and/or non-AM containers with long-running tasks.
I thought that it would be helpful to add an option (default false) whereby if a node is going to become unhealthy due to full local disk(s), it instead identifies the application that's using the most local disk space on that node, and kills that application. (Roughly analogous to how the OOM killer in Linux picks one process to kill rather than letting the machine crash.)
The benefit is that only one application is impacted, and no other application loses any containers. This prevents one user's poorly written code that shuffles/spills huge amounts of data from negatively impacting other users.
The downside is that we're killing the entire application, not just the task(s) responsible for the local disk usage. I believe it's necessary to kill the whole application instead of identifying the container running the relevant task(s), because doing so would require more knowledge of the internal state of aux services responsible for shuffling than what YARN has according to my understanding.
If this seems reasonable, I can work on the implementation.