Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Fixed
-
4.3.2
-
None
Description
Due to various issues (slow disks, network issues, bugs, etc), the bookkeeper can be slow or unresponsive for extended period of times. During this time, r/w operations will fail/timeout and ledgers will create a new segment and form a new ensemble replacing this bookie. For new ledgers, it might still pick up this bookie or we can replace this bookie with another faulty bookie if we have multiple faulty bookies.
The BK client should keep stats about these failure rates for all the bookies and it should "quarantine" failing bookies for a certain amount of time. Once a bookie is quarantined, it will not be picked up in forming a new ensemble, unless no other "healthy" bookies are available.
Solution:
Keep a counter of errors in the bookie client pool and periodically check for number of errors in a given time span and mark these bookies as "quarantined" in the BookieWatcher.
In the BookieWatcher, try to create an ensemble list excluding the quarantined bookies and if that fails, fall back to an empty exclusion list.
We will also remove the bookies from the quarantined list after a configurable period of time.