Details
-
Sub-task
-
Status: Open
-
Major
-
Resolution: Unresolved
-
Slider 0.80
-
None
Description
We lose failure history when an AM dies; this hurts reporting and doesn't allow the collection of long-term statistics.
We can use the timeline server for this information, saving events on failure, then querying it on AM restart to rebuild that history & re-use it in decision making.
They can also be presented to the user in (a) the web UI and (b) from the command line —even while a cluster is not running.
Finally, stats on node failures could be aggregated across applications, possibly even across users. This would identify hotspots for node unreliability.