Thanks Flavio for the comments. I hope the following will give more idea.
Just to confirm, elements in the myId list have to be deleted manually, yes? If a node is decommissioned, then I suppose we will want to delete from the list.
Yes, since these are persistent nodes need to delete manually or will be able to think automatic deletion after the full re-replication of that failed/decommissioned bookie. I feel, here again chances of race conditions to be avoided, like garbage collector(GC: for deleting inactive id) decides to delete the node, meanwhile the failed bookie rejoins and acquire his 'MyId'. Now, if the GC is going ahead with 'MyId' deletion, will cause inconsistencies.
In step 2 of the monitor (managing the chain), it says that the auditor notifies some other bookie that it needs to handle re-replication. How exactly does this notification happen? Bookies currently don't talk to each other directly. We would need to do this communication through zookeeper if we want to keep bookies decoupled.
Oh, seems not explained clearly in the docs. Yes, I'm trying to use ZK based communication.
As mentioned in the doc, Bookie will first create/acquire MyId and then he will be adding child watchers to his MyId. So, when the Auditor identifies failed bookie and adding its Id into the observer, in turn will notifies the observer bookie.
Please see the following example:
Monitor chain is 01 <- 02 <- 03 <- 04 <- 05 <- 06 <- 01. (<- symbol implies re-replica observer)
Say, 3, 5, 6 gone down and 1, 2, 4 are alive.
Auditor will be adding the failed bookie's Id to the observer bookie
4/3, 6/5, 1/6.
On child notification, 4 will start replication of 3.
On child notification 1 will start re-replication of 6 and 1 will also need to check is there any children present under 6 and if exists he will take care that node also. Here the chances of duplicates like, when 1 starting the action of 5, now immediately 6 has started and he will also start acting. Since there is no central co-ordination exists.
I'm thinking that, a bookie will start parsing and re-replication on:
- every child notification,
- also, on bookie startup, he will check any node exists under MyId for re-replication.
The description says L00001_ip:port, but it is not clear if ip:port corresponds to the lock holder, in which case the lock znode wouldn't be unique
How re-replication works?
Re-replication cycle is shown below:
- All bookies will be watching children of '/ledgers/underreplicas'
- On child watch notification, read the children; Here the child znode name format is LedgerName_ip:port.
Here the ip:port is the failed bookie which has the 'LedgerName' entries.
- All the live bookies will try creating ephemeral znode 'lock'(zk distributed locking) under the znode 'LedgerName_ip:port'
- Whoever succeeds will start re-replication, say BK_X
- All the others will start watching on '/ledgers/underreplicas/LedgerName_ip:port/lock'
- When the BK_X finished re-replication of LedgerName_ip:port, he will update the ledger metadata. Then, delete the LedgerName_ip:port if re-replication is fully over. Otw (assume not able to fully re-replicate, please refer the example in the doc), he will remove the 'lock' under '/ledgers/underreplicas/LedgerName_ip:port' and others will get the notification.
- On lock release/delete notification, again others will compete each other and this cycle continues till the complete re-replication.
Assume 3's IP:PORT is 10.18.40.13:2167
Take the above example, consider 3 has failed and 4 identifies the ledger 'L00001' has entries in 3.
4 will create a znode like : '/ledgers/underreplicas/L00001_10.18.40.13:2167'
Since all the live bookies are watching children of '/ledgers/underreplicas', every one will get the notification and acquiring lock for doing the re-replication. Only one bookie (say BK5) will be able to create ephemeral znode 'lock' under '/ledgers/underreplicas/L00001_10.18.40.13:2167', and tries re-replication. And all other bookies will add '/ledgers/underreplicas/L00001_10.18.40.13:2167/lock' watching to see the status of re-replication. Say after first round, if some more entires are still to be re-replicated, then the first replica(BK5) will update the ledger metadata(if any) and release the lock. Again this cycle continues till re-replication is fully over.