[SLIDER-870] use timeline server as a historical source of failure information - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: Slider 0.80
Fix Version/s: Slider 1.0.0
Component/s: appmaster, client
Labels:
None

Description

We lose failure history when an AM dies; this hurts reporting and doesn't allow the collection of long-term statistics.

We can use the timeline server for this information, saving events on failure, then querying it on AM restart to rebuild that history & re-use it in decision making.

They can also be presented to the user in (a) the web UI and (b) from the command line —even while a cluster is not running.

Finally, stats on node failures could be aggregated across applications, possibly even across users. This would identify hotspots for node unreliability.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Steve Loughran

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 12/May/15 08:17

Updated:: 13/Oct/15 18:00