We occasionally see hiccups sending assignments to supervisors, which are usually transitory. But we have seen more persistent issues with a supervisor when its disk became read-only. The supervisor remained up and was unable to start workers. Nimbus continually tried to send it assignments and failed, but just ate the exception and continued on.
We should be able to send this information to the blacklist scheduler and add the node to the blacklist when some threshold occurs.