Steve Loughran Sunil G, if you have one or two nodes and the AM container of an app fails, yarn.am.blacklisting.disable-failure-threshold will ensure that it cannot blacklist the entire cluster for that app. Once you're above the threshold, the blacklisting is cleared, and all nodes are available. Again, this is a per-app behavior. Other apps are not affected by this decision whatever.
As for the condition for applying blacklisting, I think we can add PREEMPTED to that list (for not blacklisting). I'm not so sure about KILLED_BY_RESOURCEMANAGER. I think it is possible that an AM container can be killed by the resource manager due to a node issue. Any failure of heartbeating properly will cause the AM container to be killed by the RM, but the cause of that failure of heartbeating can be many. Just because it was killed by the RM doesn't mean definitively that it was purely an app problem. What do you think?
I think we may want to approach this from the point of view of anti-affinity. Currently there is an inherent affinity to nodes when it comes to assigning the AM containers. In my view, anti-affinity is a better behavior as a default behavior. In the worst case scenario when the AM container failure was caused purely by the app, running subsequent attempts on different nodes will make it only clear the failures were unrelated to nodes. This helps troubleshooting a great deal. Today when all AM containers land on the same node, we sometimes spend a fair amount of time convincing our users that it had nothing to do with the node.
Thoughts and comments are welcome. Thanks!