Uploaded image for project: 'Bookkeeper'
  1. Bookkeeper
  2. BOOKKEEPER-889

BookKeeper client should try not to use bookies with errors/timeouts when forming a new ensemble

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 4.3.2
    • Fix Version/s: 4.4.0
    • Component/s: bookkeeper-client
    • Labels:
      None

      Description

      Due to various issues (slow disks, network issues, bugs, etc), the bookkeeper can be slow or unresponsive for extended period of times. During this time, r/w operations will fail/timeout and ledgers will create a new segment and form a new ensemble replacing this bookie. For new ledgers, it might still pick up this bookie or we can replace this bookie with another faulty bookie if we have multiple faulty bookies.
      The BK client should keep stats about these failure rates for all the bookies and it should "quarantine" failing bookies for a certain amount of time. Once a bookie is quarantined, it will not be picked up in forming a new ensemble, unless no other "healthy" bookies are available.

      Solution:
      Keep a counter of errors in the bookie client pool and periodically check for number of errors in a given time span and mark these bookies as "quarantined" in the BookieWatcher.
      In the BookieWatcher, try to create an ensemble list excluding the quarantined bookies and if that fails, fall back to an empty exclusion list.
      We will also remove the bookies from the quarantined list after a configurable period of time.

        Attachments

          Activity

            People

            • Assignee:
              sboobna Siddharth Sunil Boobna
              Reporter:
              sboobna Siddharth Sunil Boobna
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 48h
                48h
                Remaining:
                Remaining Estimate - 48h
                48h
                Logged:
                Time Spent - Not Specified
                Not Specified