We have a cluster that has one node with misconfigured Linux Container Executor. Every time an AM or regular container is launched on the cluster, it will fail. The node will still have resources available, so it keeps failing apps until the administrator notices the issue and decommissions the node. AM Blacklisting only helps, if the application is already running.
As a possible improvement, when the LCE is used on the cluster and a NM gets certain errors back from the LCE, like error 24 configuration not found, we should not try to allocate anything on the node anymore or shut down the node entirely. That kind of problem normally does not fix itself and it means that nothing can really run on that node.