Thanks again Ivan for detailed review. Could you please give few more info on the following:
In Auditor, don't call getChildren from process(). Instead all it just after the take() in main loop. In general, you shouldn't call any blocking methods from the zk event handler thread
Oh yeah, Thanks for bringing this good point. Correct, this will delay other watch notifications also.
Logic change in ZkLedgerUnderreplicationManager is wrong [previous was wrong also]. the break should be a return.
If I understand correctly, on NodeExistsException we are trying to append the missingReplica to the underreplicated ledger so that will notifies about one more down bookie which contains the ledger copy.
If we return simply, then the setData() method will not be called and there is a chance of missing the info about second replica.
For Ex: L00001 ensemble BK1, BK2, BK3.
Say BK1 fails initially, then will markUnderreplicated ledger as L000001(BK1 as the data).
Now again BK2 has failed, then while creating will get NEE, so we will append BK2 also like: L000001(BK1 BK2).
I think "break; statement" is making sense and after that the duplicate entry addition should be removed as per my latest patch.
Am I missing anything?
There should be a main method in AutoRecoveryManager.
Yeah I'll add main method also. But what about retaining start() and stop() method as public. In future this will allow others(any external entity) to manage the recovery process easily ?
I think AutoRecoveryManager should be responsible for running recoveryworker as well.
Ofcourse, I'll integrate RW also be initialized as part of ARM.
auditorElector should never be null, unless initialization fails. If initialization fails, start and stop should never be run. Perhaps we should us guava service  here as I also suggested to Uma for
I just added null check, since start() and stop() methods are public.
I'll rework on other points.