Details
-
Sub-task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
2.6.0
-
None
-
Reviewed
Description
When RM HA is enabled and running containers are kept across attempts, apps are more likely to finish successfully with more retries(attempts), so it will be better to set 'yarn.resourcemanager.am.max-attempts' larger. However it will make RMStateStore(FileSystem/HDFS/ZK) store more attempts, and make RM recover process much slower. It might be better to set max attempts to be stored in RMStateStore.
BTW: When 'attemptFailuresValidityInterval'(introduced in YARN-611) is set to a small value, retried attempts might be very large. So we need to delete some attempts stored in RMStateStore and RMStateStore.
Attachments
Attachments
Issue Links
- is blocked by
-
YARN-1039 Add parameter for YARN resource requests to indicate "long lived"
- Open
- is related to
-
YARN-4497 RM might fail to restart when recovering apps whose attempts are missing
- Resolved
-
YARN-4584 RM startup failure when AM attempts greater than max-attempts
- Resolved
-
YARN-3668 Long run service shouldn't be killed even if Yarn crashed
- Open
- relates to
-
YARN-4929 Explore a better way than sleeping for a while in some test cases
- Open