Description
Failure Detector Module
Possible Mentor
Henry Robinson (henry at apache dot org)
Requirements
Java, some distributed systems knowledge, comfort implementing distributed systems protocols
Description
ZooKeeper servers detects the failure of other servers and clients by counting the number of 'ticks' for which it doesn't get a heartbeat from other machines. This is the 'timeout' method of failure detection and works very well; however it is possible that it is too aggressive and not easily tuned for some more unusual ZooKeeper installations (such as in a wide-area network, or even in a mobile ad-hoc network).
This project would abstract the notion of failure detection to a dedicated Java module, and implement several failure detectors to compare and contrast their appropriateness for ZooKeeper. For example, Apache Cassandra uses a phi-accrual failure detector (http://ddsg.jaist.ac.jp/pub/HDY+04.pdf) which is much more tunable and has some very interesting properties. This is a great project if you are interested in distributed algorithms, or want to help re-factor some of ZooKeeper's internal code.
Attachments
Attachments
Issue Links
- blocks
-
ZOOKEEPER-823 update ZooKeeper java client to optionally use Netty for connections
- Closed
- relates to
-
HDFS-779 Automatic move to safe-mode when cluster size drops
- Open
-
HBASE-5843 Improve HBase MTTR - Mean Time To Recover
- Closed