Details
-
Sub-task
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
None
-
None
-
None
-
Slider August #1, Slider August #2, Slider September #1
Description
Use sliding windows and/or weighted moving averages to track container failures over time, and only react if many are failing in a short period.
What we do want to do here is react fast to a sudden series of failures, as well as look at average failure rates over time. I think separating startup failures from operational failures could help here. We don't want 5 failures in 5 minutes to be ignored just because everything worked well for the previous month
Attachments
Issue Links
- is related to
-
SLIDER-309 add functional tests of sliding-window failure handling
- Open
-
SLIDER-310 failure thresholds to be settable per-role
- Resolved
- relates to
-
YARN-611 Add an AM retry count reset window to YARN RM
- Closed