Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
1.0.0
-
None
Description
When standalone Executors trying to run a particular Application fail a cummulative ApplicationState.MAX_NUM_RETRY times, Master will remove the Application. This will be true even if there actually are a number of Executors that are successfully running the Application. This makes long-running standalone-mode Applications in particular unnecessarily vulnerable to limited failures in the cluster – e.g., a single bad node on which Executors repeatedly fail for any reason can prevent an Application from starting or can result in a running Application being removed even though it could continue to run successfully (just not making use of all potential Workers and Executors.)
Attachments
Issue Links
- is related to
-
SPARK-2424 ApplicationState.MAX_NUM_RETRY should be configurable
- Resolved
- relates to
-
SPARK-3289 Avoid job failures due to rescheduling of failing tasks on buggy machines
- Resolved
- links to