[SPARK-29683] Job failed due to executor failures all available nodes are blacklisted - ASF JIRA

Attach files

Attach Screenshot

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.0.0
Fix Version/s: 3.1.1
Component/s: Spark Core, YARN
Labels:
None

Description

My streaming job will fail due to executor failures all available nodes are blacklisted. This exception is thrown only when all node is blacklisted:

def isAllNodeBlacklisted: Boolean = currentBlacklistedYarnNodes.size >= numClusterNodes

val allBlacklistedNodes = excludeNodes ++ schedulerBlacklist ++ allocatorBlacklist.keySet

After diving into the code, I found some critical conditions not be handled properly:

unchecked `excludeNodes`: it comes from user config. If not set properly, it may lead to "currentBlacklistedYarnNodes.size >= numClusterNodes". For example, we may set some nodes not in Yarn cluster.
```
excludeNodes = (invalid1, invalid2, invalid3)
clusterNodes = (valid1, valid2)
```

`numClusterNodes` may equals 0: When HA Yarn failover, it will take some time for all NodeManagers to register ResourceManager again. In this case, `numClusterNode` may equals 0 or some other number, and Spark driver failed.
too strong condition check: Spark driver will fail as long as "currentBlacklistedYarnNodes.size >= numClusterNodes". This condition should not indicate a unrecovered fatal. For example, there are some NodeManagers restarting. So we can give some waiting time before job failed.