Details
-
Improvement
-
Status: In Progress
-
Major
-
Resolution: Unresolved
-
3.3.2
-
None
-
None
Description
In production environment, we hit an issue like this:
If we request 10 containers form nodeA and nodeB, first response from Yarn return 5 contianers from nodeA and nodeB, then nodeA blacklisted, and second response from Yarn maybe return some containers from nodeA and launching containers, but when containers(Executor) setup and send register request to Driver, it will be rejected and this failure will be counted to
spark.yarn.max.executor.failures
, and will casue app failed.
Max number of executor failures ($maxNumExecutorFailures) reached