Uploaded image for project: 'Slider'
  1. Slider
  2. SLIDER-69 Uber JIRA: Slider apps to withstand failures
  3. SLIDER-203

Implement scalable failure threshold based on percentage of instances failing over a time period

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Won't Fix
    • Slider 0.40
    • None
    • appmaster, test
    • None
    • Slider August #1, Slider August #2, Slider September #1

    Description

      SLIDER-77 proposed weighted moving averages for failures. This has some flaws

      1. it's hard to understand and configure
      2. different cluster sizes need different default values
      3. if you flex a cluster, it the threshold may become inapppropriate

      I propose something more tangible and related to how to track physical nodes: percentage failing over a time period.

      For example, we could define a functional hbase cluster as:
      200% of masters failing per day (for two masters == 4 failures)
      80% of region servers per day (for 20 region servers, that's 16 failures)

      Every day the counter could be reset.

      Flexing complicates the equation: it may be simplest just to reset the counters, at least when scaling down. Otherwise if a 20 worker cluster had a failure count of 5, and a 40% threshold, all would be well. But scale it down to 10 nodes and the failure count is immediately over the limit.

      Attachments

        Activity

          People

            Unassigned Unassigned
            stevel@apache.org Steve Loughran
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Agile

                Completed Sprints:
                Slider August #1 ended 21/Aug/14
                Slider August #2 ended 04/Sep/14
                Slider September #1 ended 02/Oct/14
                View on Board

                Slack

                  Issue deployment