There's an ambiguity between sick nodes (typically due to failing hardware, either hard drive or memory or occasionally NIC/network switch) and nodes that have been rendered unresponsive due to user abuse. The existing blacklist heuristics touch on this, but they're a bit ad hoc, and there's not much visibility on the internal state at any given time.
One improvement would be to track the per-node, per-job blacklisting history in a sliding window that's divided into buckets of some suitable granularity. Bad hardware would tend to show up as an elevated fault level on one node (or a few nodes) for an extended period-
i.e., multiple buckets-while abusive jobs would tend to show up as a spike (ideally) or at least a limited-duration jump in faults (one or a few buckets) across many nodes.
Because the heuristics are open to argument even among experts (which would not include me), and because automatic, hardcoded blacklisting has the potential to wipe out a good fraction of a cluster for the wrong reasons, it would seem best to convert the heuristic form of blacklisting to an advisory mode (i.e., "graylisting") until the behavior is better understood.