My streaming job will fail due to executor failures all available nodes are blacklisted. This exception is thrown only when all node is blacklisted:
After diving into the code, I found some critical conditions not be handled properly:
- unchecked `excludeNodes`: it comes from user config. If not set properly, it may lead to "currentBlacklistedYarnNodes.size >= numClusterNodes". For example, we may set some nodes not in Yarn cluster.
- `numClusterNodes` may equals 0: When HA Yarn failover, it will take some time for all NodeManagers to register ResourceManager again. In this case, `numClusterNode` may equals 0 or some other number, and Spark driver failed.
- too strong condition check: Spark driver will fail as long as "currentBlacklistedYarnNodes.size >= numClusterNodes". This condition should not indicate a unrecovered fatal. For example, there are some NodeManagers restarting. So we can give some waiting time before job failed.