[SLIDER-203] Implement scalable failure threshold based on percentage of instances failing over a time period - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Won't Fix
Affects Version/s: Slider 0.40
Fix Version/s: None
Component/s: appmaster, test
Labels:
None

Sprint:
Slider August #1, Slider August #2, Slider September #1

Description

~~SLIDER-77~~ proposed weighted moving averages for failures. This has some flaws

it's hard to understand and configure
different cluster sizes need different default values
if you flex a cluster, it the threshold may become inapppropriate

I propose something more tangible and related to how to track physical nodes: percentage failing over a time period.

For example, we could define a functional hbase cluster as:
200% of masters failing per day (for two masters == 4 failures)
80% of region servers per day (for 20 region servers, that's 16 failures)

Every day the counter could be reset.

Flexing complicates the equation: it may be simplest just to reset the counters, at least when scaling down. Otherwise if a 20 worker cluster had a failure count of 5, and a 40% threshold, all would be well. But scale it down to 10 nodes and the failure count is immediately over the limit.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Steve Loughran

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 02/Jul/14 12:59

Updated:: 11/Aug/14 10:39

Resolved:: 05/Aug/14 10:43

Agile

Completed Sprints:: Slider August #1 ended 21/Aug/14
Slider August #2 ended 04/Sep/14
Slider September #1 ended 02/Oct/14

View on Board

Implement scalable failure threshold based on percentage of instances failing over a time period