Is that different users, app names, or .
Yes. App name is the first point came in to my thoughts. As you mentioned, challenge here is to find the real buggy application which comes in as a workflow. There also can be a genuine cases, where a workflow of jobs failed because of node problem.
To overcome this, multiple inputs can be considered. Such as app name, user, queue etc.
An app from "user1" with name "job1" failed on node1. If again same app name "job1" fails on same node, an immediate history or current running AM in that node can be cross checked. This may give a better idea about the behavior in that node.
IN simple words, a sample rate of 2 or more (different applications categorized from name/user etc) always has to be considered before taking a decision on a node.
If an app from "user1" with name "job2" fails on node1, it is very much appropriate to try its second attempt in a different node.
However we need a "don't even go there" level to avoid rapid rescheduling of failed AM attempts on the same node in a busy cluster scenario.
This is one of the real intention from my side also. But a continuos monitoring in cluster with its historical data will play a pivotal role here, and one decision making point also has to be time. I feel i could jot down few points and share as a doc for same, and we can see whether this adds a value to system without causing a chance to hack.