Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
3.0.0
-
None
Description
My streaming job will fail due to executor failures all available nodes are blacklisted. This exception is thrown only when all node is blacklisted:
def isAllNodeBlacklisted: Boolean = currentBlacklistedYarnNodes.size >= numClusterNodes
val allBlacklistedNodes = excludeNodes ++ schedulerBlacklist ++ allocatorBlacklist.keySet
After diving into the code, I found some critical conditions not be handled properly:
- unchecked `excludeNodes`: it comes from user config. If not set properly, it may lead to "currentBlacklistedYarnNodes.size >= numClusterNodes". For example, we may set some nodes not in Yarn cluster.
excludeNodes = (invalid1, invalid2, invalid3) clusterNodes = (valid1, valid2)
- `numClusterNodes` may equals 0: When HA Yarn failover, it will take some time for all NodeManagers to register ResourceManager again. In this case, `numClusterNode` may equals 0 or some other number, and Spark driver failed.
- too strong condition check: Spark driver will fail as long as "currentBlacklistedYarnNodes.size >= numClusterNodes". This condition should not indicate a unrecovered fatal. For example, there are some NodeManagers restarting. So we can give some waiting time before job failed.
Attachments
Issue Links
- is caused by
-
SPARK-16630 Blacklist a node if executors won't launch on it.
- Resolved
- links to