Details
Description
SLIDER-77 proposed weighted moving averages for failures. This has some flaws
- it's hard to understand and configure
- different cluster sizes need different default values
- if you flex a cluster, it the threshold may become inapppropriate
I propose something more tangible and related to how to track physical nodes: percentage failing over a time period.
For example, we could define a functional hbase cluster as:
200% of masters failing per day (for two masters == 4 failures)
80% of region servers per day (for 20 region servers, that's 16 failures)
Every day the counter could be reset.
Flexing complicates the equation: it may be simplest just to reset the counters, at least when scaling down. Otherwise if a 20 worker cluster had a failure count of 5, and a 40% threshold, all would be well. But scale it down to 10 nodes and the failure count is immediately over the limit.