>Sorry for the late comments:
>For a master/slave HA solution, two main problems are:
>1. Mechanism that determines a master in a cluster during startup and failover.
The JGroups library (whose manual can be found here: http://www.jgroups.org/javagroupsnew/docs/manual/pdf/manual.pdf ) handles automatically the election of a group coordinator. The node elected group coordinator is also the master of the cluster. In case of a failure a new group coordinator (and, consequentially, a new cluster master) will be elected.
>Handling loss of quorum,
The shared state resides entirely on HDFS (see issues
HADOOP-1876 and HADOOP-3245) so, until now, there is no shared soft-state between nodes. However the facilities for managing a shared state are present and can be used in a future update.
>split-brain and fencing in case of split-brain.
The JGroups library tries to automatically handle network partitions and merging, but given that:
- There is no shared soft-state
- There is only one access point in the whole Hadoop cluster to the HDFS (the NameNode)
the network partition problem should not be an issue (only one partition at a time can access the HDFS). In future versions a more elegant way of dealing with network partitions should be added.
> It also >requires comprehensive management tools for configuring, managing and monitoring cluster.
I am now adding JMX support. After the initial testing phase I will post result and an updated version.
>2. Sharing state information between master and slave, so that a slave node can take over as master.
>Currenly the proposed solution addresses mainly the second problem. I have not seen much information on how the first problem is addressed. While the sharing >of information between master and slave can be done in many ways, managing the master/slave cluster is a more complicated problem. Could you please add >more information on how the design handles these issues and some notes on how administrator uses this functionality to manage the cluster.
I hope I have given an answer to your question. If you need more, feel free to contact me.
>Also analysis of the impact of job tracker performance due to the introduction of this feature needs to be done.
I am about to begin the testing phase, results will follow