LarsH and I were talking about another approach to region server replication hlog queue failover yesterday, and I wanted to get some feedback on it.
Currently when handling a nodeDeleted event, the live region servers only attempt to failover the node corresponding to the event. The nodeDeleted event is only fired once, so to protect ourselves from orphaning the znode state of the failed region server in a cascading failure scenario, we move the state to the znode of the region server that is performing the failover. Since we don't have an atomic way to move this state, it gets a little tricky.
Instead of this approach, we could have the region server attempt to failover all failed region servers every time it receives a nodeDeleted event. For example, the nodeDeleted method could go something like this: refresh the region server list, get the list of region servers in the replication znode structure, attempt to lock and failover any region server listed in the replication znode structure that is not currently alive.
The same race to lock the region server znode will occur. Only one region server will get the lock and handle the failover. Each NodeFailoverWorker that gets started could simply operate on the original dead region server znode structure. If the region server fails while preforming the failover, then both the region servers will get picked up by another region server when the nodeDeleted event for the second failure is fired. Locks would have to be ephemeral nodes to prevent permanent locking of a region server when the failover region server dies. Once the replication hlog queues are successfully replicated, the znode for the dead region server can be deleted.
On the cons side, this approach makes the handling of a nodeDeleted event a heavier weight operation.
On the pros side, it makes the failover code much simpler because we no longer have to worry about moving the region server znode state around in zookeeper.
Thoughts always appreciated.