Details
-
Sub-task
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
Description
Found this while reviewing YARN-3480.
When 'attemptFailuresValidityInterval'(introduced in
YARN-611) is set to a small value, retried attempts might be very large. So we need to delete some attempts stored in RMStateStore and RMStateStore.
I think we need to have a lower limit on the failure-validaty interval to avoid situations like this.
Having this will avoid pardoning too-many failures in too-short a duration.