Details

    • Type: Sub-task Sub-task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.2.0
    • Labels:
      None

      Description

      The idea is to build automatic mechanism to find out the bookie failures. Setup the bookie failure notifications to start the re-replication process.

      There are multiple approaches to findout bookie failures. Please refer the documents attached in BookKeeper-237.

      1. BOOKKEEPER-272.Auditor.patch
        29 kB
        Rakesh R
      2. BOOKKEEPER-272.Auditor.1.patch
        43 kB
        Rakesh R
      3. BOOKKEEPER-272.6.patch
        45 kB
        Rakesh R
      4. BOOKKEEPER-272.5.patch
        47 kB
        Rakesh R
      5. BOOKKEEPER-272.4.patch
        50 kB
        Rakesh R
      6. BOOKKEEPER-272.3.patch
        51 kB
        Rakesh R
      7. BOOKKEEPER-272.2.patch
        78 kB
        Rakesh R
      8. BOOKKEEPER-272.1.patch
        36 kB
        Rakesh R

        Issue Links

          Activity

          Hide
          Flavio Junqueira added a comment -

          Hi Rakesh, One clarification. In the design document of BOOKKEEPER-237, we say: Auditor would be forming the recovery chain based on the myIds added to the ZooKeeper under the zNode /ledgers/bookie/myId. If the chain is logical and formed through zookeeper, then I don't understand why the auditor needs to be involved in the chain formation. In my understanding of the proposal, the role of the auditor is essentially to assign work (replication) to bookies. What am I missing?

          Show
          Flavio Junqueira added a comment - Hi Rakesh, One clarification. In the design document of BOOKKEEPER-237 , we say: Auditor would be forming the recovery chain based on the myIds added to the ZooKeeper under the zNode /ledgers/bookie/myId . If the chain is logical and formed through zookeeper, then I don't understand why the auditor needs to be involved in the chain formation. In my understanding of the proposal, the role of the auditor is essentially to assign work (replication) to bookies. What am I missing?
          Hide
          Rakesh R added a comment -

          Hi Flavio,

          Thanks for your interest and comments, actually I was doing prototype for the 'CircularChain' algorithm without a central auditor guy. Sorry for the late reply

          But I faced a problem in handling the following situation.

          1 <- 2 <- 3 <- 4 <- 5 <- 6 <- 1

          Consider the scenario where 2,3,4 went down. Now 5 got the notification and marked 4 as failed and moves to 4's neighbour 3. Whenever 5 is checking about 3's status, say 3 rejoins and is alive, this inturn will stop searching. Say, immediately 3 also went down. Here the chance of missing 2's failure is high.

          One solution that comes in my mind is, when 5 identifies 3 is alive he will add watcher to 3. Here again another problem is, consider 4 has rejoined and will also try adding watcher to 3. Now if we analyse the chain, the previous watcher added by 5 also will be there to 3(as ZK doesn't has unregister of watcher). Also, the level of watcher reformation will gradually increases.

          I'm bit worrying about the chances of missing watchers and isolation/race conditions with this approach.

          If I'm having an auditor, only he will look to all and inform about failure Bookies. I feel, there is no chance of missing watchers and isolation/race conditions in this approach. Only the overhead will be Auditor election.

          He(central node) will publish about the failed bookie(through znode) and after recieving the notification anyone can acquire the lock and started re-replication and cycle will continue till complete re-replication.

          I'd like to know your opinion on handling bookie failures through central entity?.

          Thanks,
          Rakesh

          Show
          Rakesh R added a comment - Hi Flavio, Thanks for your interest and comments, actually I was doing prototype for the 'CircularChain' algorithm without a central auditor guy. Sorry for the late reply But I faced a problem in handling the following situation. 1 <- 2 <- 3 <- 4 <- 5 <- 6 <- 1 Consider the scenario where 2,3,4 went down. Now 5 got the notification and marked 4 as failed and moves to 4's neighbour 3. Whenever 5 is checking about 3's status, say 3 rejoins and is alive, this inturn will stop searching. Say, immediately 3 also went down. Here the chance of missing 2's failure is high. One solution that comes in my mind is, when 5 identifies 3 is alive he will add watcher to 3. Here again another problem is, consider 4 has rejoined and will also try adding watcher to 3. Now if we analyse the chain, the previous watcher added by 5 also will be there to 3(as ZK doesn't has unregister of watcher). Also, the level of watcher reformation will gradually increases. I'm bit worrying about the chances of missing watchers and isolation/race conditions with this approach. If I'm having an auditor, only he will look to all and inform about failure Bookies. I feel, there is no chance of missing watchers and isolation/race conditions in this approach. Only the overhead will be Auditor election. He(central node) will publish about the failed bookie(through znode) and after recieving the notification anyone can acquire the lock and started re-replication and cycle will continue till complete re-replication. I'd like to know your opinion on handling bookie failures through central entity?. Thanks, Rakesh
          Hide
          Rakesh R added a comment -

          Here's a rough draft attached for reviewing the approach. Please have a look and appreciate your comments/thoughts.

          Show
          Rakesh R added a comment - Here's a rough draft attached for reviewing the approach. Please have a look and appreciate your comments/thoughts.
          Hide
          Rakesh R added a comment -

          Modified the ServerConfiguration file and fixed complilation issues.

          Show
          Rakesh R added a comment - Modified the ServerConfiguration file and fixed complilation issues.
          Hide
          Rakesh R added a comment -

          I have uploaded latest patch with few test cases for knowing the algo.

          How the Bookie failure detection algo works
          This scheme is made based on distributed approach. Apart from the single bookie auditor for detecting the bookie failures all other logics are completely distributed. I'm thinking Auditor would make the detection more simple as we have existing 'available' znode in zookeeper for knowing the bookie failures.

          Following are the logical steps:

          1. Generate BookieId:
            Each bookie will be creating a unique Id under 'bookieids' path, this is a persistent node in ZooKeeper.
            Keep IP:PORT info as his data into respective 'bookieid' znode.
            For ex: 0001 is the bookieId and 10.18.40.13:2181 is bokkie IP. 0001 znode contains 10.18.40.13:2181 as data.
          2. Build per bookie-ledger mappings:
            This will help to know the bookie's content very quickly and avoid parsing of all the ledgers for knowing the failed bookie's ledgers again and again.
          3. How to build per bookie-ledger mappings:
            When creating/reforming the ensemble, the'ledgerid' is putting under the respective 'bookieid'. During ensemble formation, metadata will give us the bookies info. We will this and add the 'ledgerid' only to the respective 'bookieid'.
            For ex:
            0001 is the bookieId, say it contains children znodes as ledgers like: 0001/L_001,L_005 etc.
            0002 is another, say it contains children znodes as ledgers like: 0002/L_005 etc.
          4. Elect an Auditor Bookie for the entire Bookie cluster: Makes it simple and minimize the duplication efforts.
            Only one Auditor is monitoring the available bookies, his responsibility is to watch the bookies 'available' znode. When he detects any bookie failure through 'NodeChildrenChanged' watcher, will publish the same in 'failedbookies' path in Zookeeper. Publishing is done by creating the respective unique 'bookieid' persistent znode in the 'failedbookies' path. All the other bookies will be watching on the 'failedbookies' znode.
          5. Starts Re-replication:
            Upon receiving the 'failebookies' notification, all the bookies will compete eachother for acquiring the lock(using zk distributed lock-ephemeral znode). Whoever acquires will start re-replication, all others will look into this lock for knowing the re-replication status. When creating the lock, the replica bookie will add its IP:PORT info to it. After finishing the re-replication, he will delete the lock and so that others will takeup and continue this cycle till the end.
            For ex: 0002/L_005/lock

          How Re-replication works, I have commented in BookKeeper-237. Please go through the link.
          https://issues.apache.org/jira/browse/BOOKKEEPER-237?focusedCommentId=13281470&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13281470

          Responsible classes:

          • ServerConfiguration : for zk path configurations
          • LedgerCreateOp,LedgerHandle : forming bookie-ledger mappings
          • Bookie : starting bookie chain
          • BKLedgerMapper : for updating bookie-ledger mapping
          • BookieIdGenerator : generating unique bookieId
          • BookieChain : forming the detection chain
          • AuditorBookieChain : by default auditor based, later it can have CircularBookieChain etc.
          • AuditorElector : do auditor election
          • Auditor : watch bookies
          • BookieObserver : listening failedbookie notifications.

          Tests:

          • BookieLedgerMetadataTest
          • AuditorBookieTest
          • BookieDetectionTest
          • BookieIdGenTest

          -Rakesh

          Show
          Rakesh R added a comment - I have uploaded latest patch with few test cases for knowing the algo. How the Bookie failure detection algo works This scheme is made based on distributed approach. Apart from the single bookie auditor for detecting the bookie failures all other logics are completely distributed. I'm thinking Auditor would make the detection more simple as we have existing 'available' znode in zookeeper for knowing the bookie failures. Following are the logical steps: Generate BookieId: Each bookie will be creating a unique Id under 'bookieids' path, this is a persistent node in ZooKeeper. Keep IP:PORT info as his data into respective 'bookieid' znode. For ex: 0001 is the bookieId and 10.18.40.13:2181 is bokkie IP. 0001 znode contains 10.18.40.13:2181 as data. Build per bookie-ledger mappings: This will help to know the bookie's content very quickly and avoid parsing of all the ledgers for knowing the failed bookie's ledgers again and again. How to build per bookie-ledger mappings: When creating/reforming the ensemble, the'ledgerid' is putting under the respective 'bookieid'. During ensemble formation, metadata will give us the bookies info. We will this and add the 'ledgerid' only to the respective 'bookieid'. For ex: 0001 is the bookieId, say it contains children znodes as ledgers like: 0001/L_001,L_005 etc. 0002 is another, say it contains children znodes as ledgers like: 0002/L_005 etc. Elect an Auditor Bookie for the entire Bookie cluster: Makes it simple and minimize the duplication efforts. Only one Auditor is monitoring the available bookies, his responsibility is to watch the bookies 'available' znode. When he detects any bookie failure through 'NodeChildrenChanged' watcher, will publish the same in 'failedbookies' path in Zookeeper. Publishing is done by creating the respective unique 'bookieid' persistent znode in the 'failedbookies' path. All the other bookies will be watching on the 'failedbookies' znode. Starts Re-replication: Upon receiving the 'failebookies' notification, all the bookies will compete eachother for acquiring the lock(using zk distributed lock-ephemeral znode). Whoever acquires will start re-replication, all others will look into this lock for knowing the re-replication status. When creating the lock, the replica bookie will add its IP:PORT info to it. After finishing the re-replication, he will delete the lock and so that others will takeup and continue this cycle till the end. For ex: 0002/L_005/lock How Re-replication works, I have commented in BookKeeper-237. Please go through the link. https://issues.apache.org/jira/browse/BOOKKEEPER-237?focusedCommentId=13281470&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13281470 Responsible classes: ServerConfiguration : for zk path configurations LedgerCreateOp,LedgerHandle : forming bookie-ledger mappings Bookie : starting bookie chain BKLedgerMapper : for updating bookie-ledger mapping BookieIdGenerator : generating unique bookieId BookieChain : forming the detection chain AuditorBookieChain : by default auditor based, later it can have CircularBookieChain etc. AuditorElector : do auditor election Auditor : watch bookies BookieObserver : listening failedbookie notifications. Tests: BookieLedgerMetadataTest AuditorBookieTest BookieDetectionTest BookieIdGenTest -Rakesh
          Hide
          Rakesh R added a comment -

          Why BKLedgerMapping : Parsing of ledgers and identifying failed bookies may take long time, consider the case where lakhs of ledgers created by NN-HA edit logs. So I thought of having a reverse mapping(bk vs its ledgers). Since this mapping is constant until any re-replication, so I made it as persistent node under 'bookieid'.

          Why Auditor : Single instance for entire bk cluster. After analyzing multiple approaches I found this would avoid many problems, like few cases I mentioned earlier in this JIRA.Please see links:
          https://issues.apache.org/jira/browse/BOOKKEEPER-272?focusedCommentId=13284884&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13284884

          Show
          Rakesh R added a comment - Why BKLedgerMapping : Parsing of ledgers and identifying failed bookies may take long time, consider the case where lakhs of ledgers created by NN-HA edit logs. So I thought of having a reverse mapping(bk vs its ledgers). Since this mapping is constant until any re-replication, so I made it as persistent node under 'bookieid'. Why Auditor : Single instance for entire bk cluster. After analyzing multiple approaches I found this would avoid many problems, like few cases I mentioned earlier in this JIRA.Please see links: https://issues.apache.org/jira/browse/BOOKKEEPER-272?focusedCommentId=13284884&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13284884
          Hide
          Rakesh R added a comment -

          I have edited the JIRA subjectline inorder to reduce the scope of this task.

          Show
          Rakesh R added a comment - I have edited the JIRA subjectline inorder to reduce the scope of this task.
          Hide
          Rakesh R added a comment -

          Thanks Ivan for the comments.I have updated the latest patch.

          Auditor Bookie(single entity in bk cluster) - watching the available bookies and setup bookie failure notifications.

          Please have a look.

          Show
          Rakesh R added a comment - Thanks Ivan for the comments.I have updated the latest patch. Auditor Bookie(single entity in bk cluster) - watching the available bookies and setup bookie failure notifications. Please have a look.
          Hide
          Ivan Kelly added a comment -

          New patch is a lot clearer. Comments follow;

          • The changes to ServerConfiguration are unnesessary. The paths aren't actually configurable, so the shouldn't be in the configuration. Also, i don't think they should be configurable. I think that all auditor data should be under the auditor znode also. So eventually we'll have the three paths:
            /ledgers/auditor/election
            /ledgers/auditor/failedbookies
            /ledgers/auditor/underreplicated
            /ledgers/auditor/
          • Election watcher does an election on the expired event. If you get this event, the bookie will shutdown (see Bookie.java)
          • For the moment, and maybe for ever, whether the auditor runs should be configurable. During the course of 4.2 we're going to be building up this stuff incrementally. We should only enable it for tests which specifically use it (to start with).
          • getMyInfo() should be called getIdentifier(). It would be great if you could modify registerBookie() to take a String rather than a port, so that the identifier for the bookie is only generated at one point ever.
          • Collections.sort() on children isn't fully correct. V_1 & V_10 will sort before V_2, as it sorts alphabetically. Better to define a comparator, which splits on the '_', parses the number and does a comparison of that.
          • Checking if you've won the election should be a matter of children.get(0).equals(myVote), once children is sorted correctly.
          • Check for specific exceptions. findbugs warns on catch (Exception e) because this also catches RuntimeExceptions. IOException is the wrong exception to use for this stuff also. Custom exceptions, defined under BookieException.
          • Instead of watching a specific node, watch for the ChildrenChanged event on the zkAuditorElectPath. Even the current winner of the election can watch this. When triggered, doElection should be fine, though you should shutdown the Auditor thread if running if you loose the election.
          • I don't understand the point of writing auditor data to the auditor znode. It doesn't seem to do anything.
          • The auditor code seems to check for failed bookies, and publishes these. We should skip a step here though, and instead of publishing the bookes, publish the ledgers which are on that bookie. In fact, it's much more efficient if it runs like this. If we only publish the failed bookies, each worker picking up a bookie to recover has to read all ledgers to find which ledgers to recover. If the auditor is maintaining a bookie -> ledger index, only one node needs to be reading all the ledgers.
          • The run loop method of Auditor isn't a loop. I think the run loop should look as like:
          public void run() {
             Set<String> bookies = getAvailableBookies();
             while (true) {
                 waitForNotification();
                 Set<String> newBookies = getAvailableBookies();
                 Set<String> lostBookies = bookies;
                 lostBookies.removeAll(newBookies);
                 bookies = newBookies;
          
                 if (lostBookies.size() > 0) {
                     continue;
                 }
          
                 Map<String, List<Long>> bookie2ledgersMap = generateBookie2LedgersIndex();
                 Set<Long> suspectedLedgers = HashSet<Long>();
                 for (String b : lostBookies) {
                     suspectedLedgers.addAll(bookie2ledgersMap.get(b));
                 }
                 publishSuspectedLedgers(suspectedLedgers);
             }
          }
          

          getAvailableBookies() should put a watch on the available znode, and fire a notification when triggered.

          Show
          Ivan Kelly added a comment - New patch is a lot clearer. Comments follow; The changes to ServerConfiguration are unnesessary. The paths aren't actually configurable, so the shouldn't be in the configuration. Also, i don't think they should be configurable. I think that all auditor data should be under the auditor znode also. So eventually we'll have the three paths: /ledgers/auditor/election /ledgers/auditor/failedbookies /ledgers/auditor/underreplicated /ledgers/auditor/ Election watcher does an election on the expired event. If you get this event, the bookie will shutdown (see Bookie.java) For the moment, and maybe for ever, whether the auditor runs should be configurable. During the course of 4.2 we're going to be building up this stuff incrementally. We should only enable it for tests which specifically use it (to start with). getMyInfo() should be called getIdentifier(). It would be great if you could modify registerBookie() to take a String rather than a port, so that the identifier for the bookie is only generated at one point ever. Collections.sort() on children isn't fully correct. V_1 & V_10 will sort before V_2, as it sorts alphabetically. Better to define a comparator, which splits on the '_', parses the number and does a comparison of that. Checking if you've won the election should be a matter of children.get(0).equals(myVote), once children is sorted correctly. Check for specific exceptions. findbugs warns on catch (Exception e) because this also catches RuntimeExceptions. IOException is the wrong exception to use for this stuff also. Custom exceptions, defined under BookieException. Instead of watching a specific node, watch for the ChildrenChanged event on the zkAuditorElectPath. Even the current winner of the election can watch this. When triggered, doElection should be fine, though you should shutdown the Auditor thread if running if you loose the election. I don't understand the point of writing auditor data to the auditor znode. It doesn't seem to do anything. The auditor code seems to check for failed bookies, and publishes these. We should skip a step here though, and instead of publishing the bookes, publish the ledgers which are on that bookie. In fact, it's much more efficient if it runs like this. If we only publish the failed bookies, each worker picking up a bookie to recover has to read all ledgers to find which ledgers to recover. If the auditor is maintaining a bookie -> ledger index, only one node needs to be reading all the ledgers. The run loop method of Auditor isn't a loop. I think the run loop should look as like: public void run() { Set< String > bookies = getAvailableBookies(); while ( true ) { waitForNotification(); Set< String > newBookies = getAvailableBookies(); Set< String > lostBookies = bookies; lostBookies.removeAll(newBookies); bookies = newBookies; if (lostBookies.size() > 0) { continue ; } Map< String , List< Long >> bookie2ledgersMap = generateBookie2LedgersIndex(); Set< Long > suspectedLedgers = HashSet< Long >(); for ( String b : lostBookies) { suspectedLedgers.addAll(bookie2ledgersMap.get(b)); } publishSuspectedLedgers(suspectedLedgers); } } getAvailableBookies() should put a watch on the available znode, and fire a notification when triggered.
          Hide
          Rakesh R added a comment -

          Thanks Ivan, again for the detailed comments. I'd like to know more on the following:

          Instead of watching a specific node, watch for the ChildrenChanged event on the zkAuditorElectPath. Even the current winner of the election can watch this. When triggered, doElection should be fine, though you should shutdown the Auditor thread if running if you loose the election.

          I have used the predecessor watching approach, used to avoid the herd effect with zookeeper leader election. Rule is, bookie will be watching to my predecessor bookie based on the ephemeral seq id. At any point of time, least ephemeral znode bookie only will get the chance to become Auditor. So I thought it would be more efficient to watch on my predecessor bookie. How does it sound?

          I don't understand the point of writing auditor data to the auditor znode. It doesn't seem to do anything.

          Hope you meant, after the auditor election I'm keeping the auditor's myInfo to the auditor election path(/ledgers/auditor/election). At present there is no logic, I've just kept for debugging purpose only(using ZK znodes).
          Below one:

          auditorElector.zkclient.setData(auditorElector.conf
          +                    .getZkAuditorElectionPath(), auditorElector.myInfo
          +                    .getBytes(), -1);
          
          

          The run loop method of Auditor isn't a loop. I think the run loop should look as like:

          Yup, I'll remove the volatile flag and do other modifications.
          Here I'd like to keep the generate & publishing suspected ledgers before entering to the waitForNotification(). Consider the scenario where the auditor bookie(say BK1) has failed and new auditor(BK2) comes, he will not recieve any notifications for the BK1 failures(from the available bookie's path).

          Show
          Rakesh R added a comment - Thanks Ivan, again for the detailed comments. I'd like to know more on the following: Instead of watching a specific node, watch for the ChildrenChanged event on the zkAuditorElectPath. Even the current winner of the election can watch this. When triggered, doElection should be fine, though you should shutdown the Auditor thread if running if you loose the election. I have used the predecessor watching approach, used to avoid the herd effect with zookeeper leader election. Rule is, bookie will be watching to my predecessor bookie based on the ephemeral seq id. At any point of time, least ephemeral znode bookie only will get the chance to become Auditor. So I thought it would be more efficient to watch on my predecessor bookie. How does it sound? I don't understand the point of writing auditor data to the auditor znode. It doesn't seem to do anything. Hope you meant, after the auditor election I'm keeping the auditor's myInfo to the auditor election path(/ledgers/auditor/election). At present there is no logic, I've just kept for debugging purpose only(using ZK znodes). Below one: auditorElector.zkclient.setData(auditorElector.conf + .getZkAuditorElectionPath(), auditorElector.myInfo + .getBytes(), -1); The run loop method of Auditor isn't a loop. I think the run loop should look as like: Yup, I'll remove the volatile flag and do other modifications. Here I'd like to keep the generate & publishing suspected ledgers before entering to the waitForNotification(). Consider the scenario where the auditor bookie(say BK1) has failed and new auditor(BK2) comes, he will not recieve any notifications for the BK1 failures(from the available bookie's path).
          Hide
          Ivan Kelly added a comment -

          I have used the predecessor watching approach, used to avoid the herd effect with zookeeper leader election. Rule is, bookie will be watching to my predecessor bookie based on the ephemeral seq id. At any point of time, least ephemeral znode bookie only will get the chance to become Auditor. So I thought it would be more efficient to watch on my predecessor bookie. How does it sound?

          This is fine. I just thought having all watch for children changed would make simpler code. I don't have a strong opinion on this though.

          Hope you meant, after the auditor election I'm keeping the auditor's myInfo to the auditor election path(/ledgers/auditor/election). At present there is no logic, I've just kept for debugging purpose only(using ZK znodes).

          Ok. Thats fine. Before releasing i'd like to convert the data stored in the znode as a protobuf text serialization. I've already done a conversion of ledger metadata but havent' generated a patch/jira yet. I've made these changes to make moving between bk versions easier. This is fine as it is for now though.

          Here I'd like to keep the generate & publishing suspected ledgers before entering to the waitForNotification(). Consider the scenario where the auditor bookie(say BK1) has failed and new auditor(BK2) comes, he will not recieve any notifications for the BK1 failures(from the available bookie's path).

          Yup, that sounds good.

          Show
          Ivan Kelly added a comment - I have used the predecessor watching approach, used to avoid the herd effect with zookeeper leader election. Rule is, bookie will be watching to my predecessor bookie based on the ephemeral seq id. At any point of time, least ephemeral znode bookie only will get the chance to become Auditor. So I thought it would be more efficient to watch on my predecessor bookie. How does it sound? This is fine. I just thought having all watch for children changed would make simpler code. I don't have a strong opinion on this though. Hope you meant, after the auditor election I'm keeping the auditor's myInfo to the auditor election path(/ledgers/auditor/election). At present there is no logic, I've just kept for debugging purpose only(using ZK znodes). Ok. Thats fine. Before releasing i'd like to convert the data stored in the znode as a protobuf text serialization. I've already done a conversion of ledger metadata but havent' generated a patch/jira yet. I've made these changes to make moving between bk versions easier. This is fine as it is for now though. Here I'd like to keep the generate & publishing suspected ledgers before entering to the waitForNotification(). Consider the scenario where the auditor bookie(say BK1) has failed and new auditor(BK2) comes, he will not recieve any notifications for the BK1 failures(from the available bookie's path). Yup, that sounds good.
          Hide
          Rakesh R added a comment -

          Hi,

          As part of this Auditor logic, I'm planning to publish the suspected/underreplicated ledger metatdata as Ledger_BookieIP:PORT:

          Assume we have three bookies BK1, BK2, BK3 and has ledgers L0001, L0002, L0003

          Example:
          Say, BK1 BK2 has failed and started two new bookies, then Auditor will be publishing as:
          /ledgers/auditor/underreplicated/L0001_BK1
          /ledgers/auditor/underreplicated/L0002_BK1
          /ledgers/auditor/underreplicated/L0003_BK1
          /ledgers/auditor/underreplicated/L0001_BK2
          /ledgers/auditor/underreplicated/L0002_BK2
          /ledgers/auditor/underreplicated/L0003_BK2

          By this, I'm thinking the target re-replicators would be able to acquire distributed lock on L0001_BK1...etc and start re-replicating. Is this ok for you?

          Thanks,
          Rakesh

          Show
          Rakesh R added a comment - Hi, As part of this Auditor logic, I'm planning to publish the suspected/underreplicated ledger metatdata as Ledger_BookieIP:PORT: Assume we have three bookies BK1, BK2, BK3 and has ledgers L0001, L0002, L0003 Example: Say, BK1 BK2 has failed and started two new bookies, then Auditor will be publishing as: /ledgers/auditor/underreplicated/L0001_BK1 /ledgers/auditor/underreplicated/L0002_BK1 /ledgers/auditor/underreplicated/L0003_BK1 /ledgers/auditor/underreplicated/L0001_BK2 /ledgers/auditor/underreplicated/L0002_BK2 /ledgers/auditor/underreplicated/L0003_BK2 By this, I'm thinking the target re-replicators would be able to acquire distributed lock on L0001_BK1...etc and start re-replicating. Is this ok for you? Thanks, Rakesh
          Hide
          Ivan Kelly added a comment -

          I don't think we need the bookie. As we need to run a check on the ledger to find which parts are underreplicated (since some segments may not include the failed bookie), we may as well just record the ledger id. Also, it'd be better to only have one worker fixing a single ledger to avoid conflicting writes when updating the ledger metadata.

          Show
          Ivan Kelly added a comment - I don't think we need the bookie. As we need to run a check on the ledger to find which parts are underreplicated (since some segments may not include the failed bookie), we may as well just record the ledger id. Also, it'd be better to only have one worker fixing a single ledger to avoid conflicting writes when updating the ledger metadata.
          Hide
          Rakesh R added a comment -

          I don't think we need the bookie

          Here I could see one race condition. Say first Auditor is coming to publish failure of BK2 in L0001. Meantime BK4 has finished the re-replication of BK3's L0001 and about to delete the entry from /underreplicated. In this case, Auditor will silently continues by seeing L0001 and the other worker will delete the L0001 entry thinking there is no more failures.

          Solution I'm thinking to check the data version before doing zk operation(similar logic we built in BKJM CurrentInProgress). I'm planning to keep data as failed bookie information.

          As we need to run a check on the ledger to find which parts are underreplicated (since some segments may not include the failed bookie), we may as well just record the ledger id.Also, it'd be better to only have one worker fixing a single ledger to avoid conflicting writes when updating the ledger metadata.

          Yeah, I understand. I'm having one suggestion, anyway auditor knows about the failed bookies and its ledgers when publishing the underreplicated ledgers. Why don't we keep the failed bookie as data inside the underreplicated ledger. So the worker(segment checker) only looks to this bookie and get corresponding index directly from the ZK ledger metadata?.

          Show
          Rakesh R added a comment - I don't think we need the bookie Here I could see one race condition. Say first Auditor is coming to publish failure of BK2 in L0001. Meantime BK4 has finished the re-replication of BK3's L0001 and about to delete the entry from /underreplicated. In this case, Auditor will silently continues by seeing L0001 and the other worker will delete the L0001 entry thinking there is no more failures. Solution I'm thinking to check the data version before doing zk operation(similar logic we built in BKJM CurrentInProgress). I'm planning to keep data as failed bookie information. As we need to run a check on the ledger to find which parts are underreplicated (since some segments may not include the failed bookie), we may as well just record the ledger id.Also, it'd be better to only have one worker fixing a single ledger to avoid conflicting writes when updating the ledger metadata. Yeah, I understand. I'm having one suggestion, anyway auditor knows about the failed bookies and its ledgers when publishing the underreplicated ledgers. Why don't we keep the failed bookie as data inside the underreplicated ledger. So the worker(segment checker) only looks to this bookie and get corresponding index directly from the ZK ledger metadata?.
          Hide
          Ivan Kelly added a comment -

          I don't think we need the bookie

          Here I could see one race condition. Say first Auditor is coming to publish failure of BK2 in L0001. Meantime BK4 has finished the re-replication of BK3's L0001 and about to delete the entry from /underreplicated. In this case, Auditor will silently continues by seeing L0001 and the other worker will delete the L0001 entry thinking there is no more failures.

          Solution I'm thinking to check the data version before doing zk operation(similar logic we built in BKJM CurrentInProgress). I'm planning to keep data as failed bookie information.

          Yes, I think in this case, when we see BK3's failure and L0001 already exists, we should bump the version number. We shouldn't really be changing any vital data in zookeeper without checking the version number anyhow. Hopefully this will be a very rare situation anyhow, in a 3e2q ledger, two machines dropping like this would probably mean data loss.


          Yeah, I understand. I'm having one suggestion, anyway auditor knows about the failed bookies and its ledgers when publishing the underreplicated ledgers. Why don't we keep the failed bookie as data inside the underreplicated ledger. So the worker(segment checker) only looks to this bookie and get corresponding index directly from the ZK ledger metadata?.

          Im not sure what you mean here. Having the failed bookie stored in the data is useful for debugging purposes, but we should do a check on the ledger beforehand anyhow to determine what to recover. Are you trying to avoid another read to the ledger znode?

          Show
          Ivan Kelly added a comment - I don't think we need the bookie Here I could see one race condition. Say first Auditor is coming to publish failure of BK2 in L0001. Meantime BK4 has finished the re-replication of BK3's L0001 and about to delete the entry from /underreplicated. In this case, Auditor will silently continues by seeing L0001 and the other worker will delete the L0001 entry thinking there is no more failures. Solution I'm thinking to check the data version before doing zk operation(similar logic we built in BKJM CurrentInProgress). I'm planning to keep data as failed bookie information. Yes, I think in this case, when we see BK3's failure and L0001 already exists, we should bump the version number. We shouldn't really be changing any vital data in zookeeper without checking the version number anyhow. Hopefully this will be a very rare situation anyhow, in a 3e2q ledger, two machines dropping like this would probably mean data loss. Yeah, I understand. I'm having one suggestion, anyway auditor knows about the failed bookies and its ledgers when publishing the underreplicated ledgers. Why don't we keep the failed bookie as data inside the underreplicated ledger. So the worker(segment checker) only looks to this bookie and get corresponding index directly from the ZK ledger metadata?. Im not sure what you mean here. Having the failed bookie stored in the data is useful for debugging purposes, but we should do a check on the ledger beforehand anyhow to determine what to recover. Are you trying to avoid another read to the ledger znode?
          Hide
          Rakesh R added a comment -

          Oh, seems that I did't explain clearly at my comment.

          Im not sure what you mean here. Having the failed bookie stored in the data is useful for debugging purposes, but we should do a check on the ledger beforehand anyhow to determine what to recover. Are you trying to avoid another read to the ledger znode?

          My point is, rather than reading with bookieclient and finding the badReplicas, why don't we find the fragments using the ledgermetadata present in zookeeper as we already knows the failed bookie.

          Show
          Rakesh R added a comment - Oh, seems that I did't explain clearly at my comment. Im not sure what you mean here. Having the failed bookie stored in the data is useful for debugging purposes, but we should do a check on the ledger beforehand anyhow to determine what to recover. Are you trying to avoid another read to the ledger znode? My point is, rather than reading with bookieclient and finding the badReplicas, why don't we find the fragments using the ledgermetadata present in zookeeper as we already knows the failed bookie.
          Hide
          Ivan Kelly added a comment -

          I think we should probe in any case though. For example, if a bookie has restarted, it's znode will disappear momentarily. All ledgers will be marked for checking by the auditor, but they're still available after the bookie comes back up.

          Show
          Ivan Kelly added a comment - I think we should probe in any case though. For example, if a bookie has restarted, it's znode will disappear momentarily. All ledgers will be marked for checking by the auditor, but they're still available after the bookie comes back up.
          Hide
          Rakesh R added a comment -

          Its just suggestion only. I think will see whether any advantage in using failedbookies information during detection logic BOOKKEEPER-247 JIRA.

          From Auditor side, anyway it needs the data version logic to address the race condition mentioned in our above comments. So the auditor is publishing the underreplicated ledgerIds and the metadata looks like:

          >> LedgerId with failedbookie details as comma separated data.

          Example:
          /ledgers/auditor/underreplicated/L0001 (data as -> Metadataversion, BK1, BK2)
          /ledgers/auditor/underreplicated/L0002 (data as -> Metadataversion, BK1, BK2)
          /ledgers/auditor/underreplicated/L0003 (data as -> Metadataversion, BK1, BK2)

          Show
          Rakesh R added a comment - Its just suggestion only. I think will see whether any advantage in using failedbookies information during detection logic BOOKKEEPER-247 JIRA. From Auditor side, anyway it needs the data version logic to address the race condition mentioned in our above comments. So the auditor is publishing the underreplicated ledgerIds and the metadata looks like: >> LedgerId with failedbookie details as comma separated data. Example: /ledgers/auditor/underreplicated/L0001 (data as -> Metadataversion, BK1, BK2) /ledgers/auditor/underreplicated/L0002 (data as -> Metadataversion, BK1, BK2) /ledgers/auditor/underreplicated/L0003 (data as -> Metadataversion, BK1, BK2)
          Hide
          Rakesh R added a comment -

          @Ivan
          I've refactored as per the comments, please go through this.

          Also, I think there is a bit cross over with BOOKKEEPER-246 patch. I'll go through it.

          I just created AutoRecoveryService layer which has #start #stop #isRunning apis, just to plugin as a service.

          For the moment, and maybe for ever, whether the auditor runs should be configurable

          I haven't included in this patch, since there is still confusions whether to treat as a plugin service or not.

          Show
          Rakesh R added a comment - @Ivan I've refactored as per the comments, please go through this. Also, I think there is a bit cross over with BOOKKEEPER-246 patch. I'll go through it. I just created AutoRecoveryService layer which has #start #stop #isRunning apis, just to plugin as a service. For the moment, and maybe for ever, whether the auditor runs should be configurable I haven't included in this patch, since there is still confusions whether to treat as a plugin service or not.
          Hide
          Flavio Junqueira added a comment - - edited

          For the moment, and maybe for ever, whether the auditor runs should be configurable

          I haven't included in this patch, since there is still confusions whether to treat as a plugin service or not.

          I think that whether the auditor runs or not should be configurable. Is that a problem?

          Show
          Flavio Junqueira added a comment - - edited For the moment, and maybe for ever, whether the auditor runs should be configurable I haven't included in this patch, since there is still confusions whether to treat as a plugin service or not. I think that whether the auditor runs or not should be configurable. Is that a problem?
          Hide
          Rakesh R added a comment -

          I think that whether the auditor runs or not should be configurable. Is that a problem?

          Hi Flavio, Following are the options to make thie feature configurable:

          1. Initially I was thinking to provide new configuration item 'enableAutoRecovery=true' in the bk_server.conf file. By default this feature will be disabled.
          2. But after seeing the comments in BOOKKEEPER-304. Whether to consider as plugin process?
            https://issues.apache.org/jira/browse/BOOKKEEPER-304?focusedCommentId=13398408&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13398408
            For this, we need to expose a ServicePlugin interface and will be configured in the bk_server.conf for enabling the feature.

          I'm just confused to take which one is more feasible?

          Show
          Rakesh R added a comment - I think that whether the auditor runs or not should be configurable. Is that a problem? Hi Flavio, Following are the options to make thie feature configurable: Initially I was thinking to provide new configuration item 'enableAutoRecovery=true' in the bk_server.conf file. By default this feature will be disabled. But after seeing the comments in BOOKKEEPER-304 . Whether to consider as plugin process? https://issues.apache.org/jira/browse/BOOKKEEPER-304?focusedCommentId=13398408&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13398408 For this, we need to expose a ServicePlugin interface and will be configured in the bk_server.conf for enabling the feature. I'm just confused to take which one is more feasible?
          Hide
          Ivan Kelly added a comment -

          For the moment, I think we should keep it in the bookie process and enabled by default. We shouldn't worry about making a plugin interface, or keeping it in a separate maven module. There's no strong requirement for it, and it just confuses matters.

          However, it is good to have a clean interface for starting and stopping the service.

          Regarding BOOKKEEPER-246, this patch should call the interface defined to record the failed ledgers. I'll take a look at this in a while. I'll also work out the order in which the patches will need to go in.

          Show
          Ivan Kelly added a comment - For the moment, I think we should keep it in the bookie process and enabled by default. We shouldn't worry about making a plugin interface, or keeping it in a separate maven module. There's no strong requirement for it, and it just confuses matters. However, it is good to have a clean interface for starting and stopping the service. Regarding BOOKKEEPER-246 , this patch should call the interface defined to record the failed ledgers. I'll take a look at this in a while. I'll also work out the order in which the patches will need to go in.
          Hide
          Uma Maheswara Rao G added a comment -

          +1, I am also more inclined to keep in bookie process for now.

          Show
          Uma Maheswara Rao G added a comment - +1, I am also more inclined to keep in bookie process for now.
          Hide
          Flavio Junqueira added a comment -

          I think we should keep it in the bookie process and enabled by default

          I've expressed this concern before that I don't understand completely the use cases for the auditor. Short-lived ledgers will not need it because the window of vulnerability is narrow. In my mind the auditor is mainly for use cases that require long-lived ledgers. What's the rationale for having it enabled by default?

          There's no strong requirement for it, and it just confuses matters.

          It confuses matters to mix both. BookKeeper provides a set of guarantees, and the auditor complements that set of properties by healing the replica set of under-replicated ledgers. The auditor also does not provide any functionality in the critical path of BookKeeper client calls. Given these two observations, it makes more sense to me to keep it separated. It makes it simpler to reason about what the system is doing and may avoid undesirable interferences.

          Show
          Flavio Junqueira added a comment - I think we should keep it in the bookie process and enabled by default I've expressed this concern before that I don't understand completely the use cases for the auditor. Short-lived ledgers will not need it because the window of vulnerability is narrow. In my mind the auditor is mainly for use cases that require long-lived ledgers. What's the rationale for having it enabled by default? There's no strong requirement for it, and it just confuses matters. It confuses matters to mix both. BookKeeper provides a set of guarantees, and the auditor complements that set of properties by healing the replica set of under-replicated ledgers. The auditor also does not provide any functionality in the critical path of BookKeeper client calls. Given these two observations, it makes more sense to me to keep it separated. It makes it simpler to reason about what the system is doing and may avoid undesirable interferences.
          Hide
          Flavio Junqueira added a comment -

          To summarize an offline discussion I had with Ivan, there are two points that I raised during the discussion:

          1. If the bookie and the auditor run on the same jvm, is it possible that runtime exceptions thrown by the auditor will kill the whole jvm and consequently the bookie as well?
          2. If the bookie and the auditor run on the same jvm, it might be more difficult to debug. In particular, I was wondering about how to separate the log messages for postmortem analysis and debugging in general.
          3. The performance of a bookie is more critical than the performance of the auditor. Running them on separate jvms might improve performance isolation.
          Show
          Flavio Junqueira added a comment - To summarize an offline discussion I had with Ivan, there are two points that I raised during the discussion: If the bookie and the auditor run on the same jvm, is it possible that runtime exceptions thrown by the auditor will kill the whole jvm and consequently the bookie as well? If the bookie and the auditor run on the same jvm, it might be more difficult to debug. In particular, I was wondering about how to separate the log messages for postmortem analysis and debugging in general. The performance of a bookie is more critical than the performance of the auditor. Running them on separate jvms might improve performance isolation.
          Hide
          Rakesh R added a comment -

          Thanks Ivan,Uma,Flavio for your thoughts. It would be great if others could review my changes.

          Following are the changes made in the latest patch:

          • by considering Ivan's point, provided AutoRecoveryManager#start() and #stop() methods and removed the dependency with Bookie server. Anyone can plugin the service by invoking start() and stop() operational methods.
          • I've raised another JIRA BOOKKEEPER-319 to finalize the way to manage the recovery service.
          • also corrected ZkLedgerUnderreplicationManager.java, where adding duplicate 'missingreplica' in case of an underreplicated ledger contains multiple bookie failures.

          Note: This patch has dependency with BOOKKEEPER-304 and to be applied after BOOKKEEPER-304 patch

          Show
          Rakesh R added a comment - Thanks Ivan,Uma,Flavio for your thoughts. It would be great if others could review my changes. Following are the changes made in the latest patch: by considering Ivan's point, provided AutoRecoveryManager#start() and #stop() methods and removed the dependency with Bookie server. Anyone can plugin the service by invoking start() and stop() operational methods. I've raised another JIRA BOOKKEEPER-319 to finalize the way to manage the recovery service. also corrected ZkLedgerUnderreplicationManager.java, where adding duplicate 'missingreplica' in case of an underreplicated ledger contains multiple bookie failures. Note: This patch has dependency with BOOKKEEPER-304 and to be applied after BOOKKEEPER-304 patch
          Hide
          Rakesh R added a comment -

          Hi All, Could you have a look and review the latest patch.

          Thanks,
          Rakesh

          Show
          Rakesh R added a comment - Hi All, Could you have a look and review the latest patch. Thanks, Rakesh
          Hide
          Ivan Kelly added a comment -

          I've had another look over this patch. I have some comments.

          Logic change in ZkLedgerUnderreplicationManager is wrong [previous was wrong also]. the break should be a return.

          Auditor will be a top level daemon, so it should own its create it's own LedgerManagers and ZooKeeper client. The only thing passed in to is should be the configuration.

          There should be a main method in AutoRecoveryManager. I think AutoRecoveryManager should be responsible for running recoveryworker as well. Im not sure whether the node which runs auditor should run a recovery worker at the same time. This is an open question.

          auditorElector should never be null, unless initialization fails. If initialization fails, start and stop should never be run. Perhaps we should us guava service [1] here as I also suggested to Uma for BOOKKEEPER-248.

          Pattern "ELECTION_PATH + PATH_SEPARATOR + vote" is used in a lot of places. It would be nice to move this into a method "String getVotePath(String vote);"

          In Auditor, don't call getChildren from process(). Instead all it just after the take() in main loop. In general, you shouldn't call any blocking methods from the zk event handler thread.

          Show
          Ivan Kelly added a comment - I've had another look over this patch. I have some comments. Logic change in ZkLedgerUnderreplicationManager is wrong [previous was wrong also] . the break should be a return. Auditor will be a top level daemon, so it should own its create it's own LedgerManagers and ZooKeeper client. The only thing passed in to is should be the configuration. There should be a main method in AutoRecoveryManager. I think AutoRecoveryManager should be responsible for running recoveryworker as well. Im not sure whether the node which runs auditor should run a recovery worker at the same time. This is an open question. auditorElector should never be null, unless initialization fails. If initialization fails, start and stop should never be run. Perhaps we should us guava service [1] here as I also suggested to Uma for BOOKKEEPER-248 . Pattern "ELECTION_PATH + PATH_SEPARATOR + vote" is used in a lot of places. It would be nice to move this into a method "String getVotePath(String vote);" In Auditor, don't call getChildren from process(). Instead all it just after the take() in main loop. In general, you shouldn't call any blocking methods from the zk event handler thread.
          Hide
          Uma Maheswara Rao G added a comment -

          There should be a main method in AutoRecoveryManager. I think AutoRecoveryManager should be responsible for running recoveryworker as well. Im not sure whether the node which runs auditor should run a recovery worker at the same time. This is an open question.

          Oh, I was assuming that, every Recovery node will have (Auditor Participants + RW ) services running. But one will actually win for doing auditor work. But RW will do its job normally in all nodes. All other recovery node's auditor participants will look for the next election to win the Real Auditing job.
          I am not sure my assumption is same as others. Any other thoughts?

          Show
          Uma Maheswara Rao G added a comment - There should be a main method in AutoRecoveryManager. I think AutoRecoveryManager should be responsible for running recoveryworker as well. Im not sure whether the node which runs auditor should run a recovery worker at the same time. This is an open question. Oh, I was assuming that, every Recovery node will have (Auditor Participants + RW ) services running. But one will actually win for doing auditor work. But RW will do its job normally in all nodes. All other recovery node's auditor participants will look for the next election to win the Real Auditing job. I am not sure my assumption is same as others. Any other thoughts?
          Hide
          Ivan Kelly added a comment -

          I was thinking along the same lines.

          Show
          Ivan Kelly added a comment - I was thinking along the same lines.
          Hide
          Rakesh R added a comment -

          Thanks again Ivan for detailed review. Could you please give few more info on the following:

          In Auditor, don't call getChildren from process(). Instead all it just after the take() in main loop. In general, you shouldn't call any blocking methods from the zk event handler thread

          Oh yeah, Thanks for bringing this good point. Correct, this will delay other watch notifications also.

          Logic change in ZkLedgerUnderreplicationManager is wrong [previous was wrong also]. the break should be a return.

          If I understand correctly, on NodeExistsException we are trying to append the missingReplica to the underreplicated ledger so that will notifies about one more down bookie which contains the ledger copy.
          If we return simply, then the setData() method will not be called and there is a chance of missing the info about second replica.

          For Ex: L00001 ensemble BK1, BK2, BK3.
          Say BK1 fails initially, then will markUnderreplicated ledger as L000001(BK1 as the data).
          Now again BK2 has failed, then while creating will get NEE, so we will append BK2 also like: L000001(BK1 BK2).

          I think "break; statement" is making sense and after that the duplicate entry addition should be removed as per my latest patch.
          Am I missing anything?

          There should be a main method in AutoRecoveryManager.

          Yeah I'll add main method also. But what about retaining start() and stop() method as public. In future this will allow others(any external entity) to manage the recovery process easily ?

          I think AutoRecoveryManager should be responsible for running recoveryworker as well.

          Ofcourse, I'll integrate RW also be initialized as part of ARM.

          auditorElector should never be null, unless initialization fails. If initialization fails, start and stop should never be run. Perhaps we should us guava service [1] here as I also suggested to Uma for BOOKKEEPER-248.

          I just added null check, since start() and stop() methods are public.

          I'll rework on other points.

          Show
          Rakesh R added a comment - Thanks again Ivan for detailed review. Could you please give few more info on the following: In Auditor, don't call getChildren from process(). Instead all it just after the take() in main loop. In general, you shouldn't call any blocking methods from the zk event handler thread Oh yeah, Thanks for bringing this good point. Correct, this will delay other watch notifications also. Logic change in ZkLedgerUnderreplicationManager is wrong [previous was wrong also] . the break should be a return. If I understand correctly, on NodeExistsException we are trying to append the missingReplica to the underreplicated ledger so that will notifies about one more down bookie which contains the ledger copy. If we return simply, then the setData() method will not be called and there is a chance of missing the info about second replica. For Ex: L00001 ensemble BK1, BK2, BK3. Say BK1 fails initially, then will markUnderreplicated ledger as L000001(BK1 as the data). Now again BK2 has failed, then while creating will get NEE, so we will append BK2 also like: L000001(BK1 BK2). I think "break; statement" is making sense and after that the duplicate entry addition should be removed as per my latest patch. Am I missing anything? There should be a main method in AutoRecoveryManager. Yeah I'll add main method also. But what about retaining start() and stop() method as public. In future this will allow others(any external entity) to manage the recovery process easily ? I think AutoRecoveryManager should be responsible for running recoveryworker as well. Ofcourse, I'll integrate RW also be initialized as part of ARM. auditorElector should never be null, unless initialization fails. If initialization fails, start and stop should never be run. Perhaps we should us guava service [1] here as I also suggested to Uma for BOOKKEEPER-248 . I just added null check, since start() and stop() methods are public. I'll rework on other points.
          Hide
          Rakesh R added a comment -

          I also agree with you guys. Since Auditor is a stateless light weight daemon thread and only a single instance within the bk cluster, he can start RW service also along.

          Oh, I was assuming that, every Recovery node will have (Auditor Participants + RW ) services running. But one will actually win for doing auditor work. But RW will do its job normally in all nodes. All other recovery node's auditor participants will look for the next election to win the Real Auditing job.
          I am not sure my assumption is same as others. Any other thoughts?

          Show
          Rakesh R added a comment - I also agree with you guys. Since Auditor is a stateless light weight daemon thread and only a single instance within the bk cluster, he can start RW service also along. Oh, I was assuming that, every Recovery node will have (Auditor Participants + RW ) services running. But one will actually win for doing auditor work. But RW will do its job normally in all nodes. All other recovery node's auditor participants will look for the next election to win the Real Auditing job. I am not sure my assumption is same as others. Any other thoughts?
          Hide
          Rakesh R added a comment -

          Hi Ivan, Hope I'm not confusing you

          As per the discussion/comments, now AutoRecoveryManager will have dependency with RW(that is still under dev stage) and also will make use of guava service for cleaner. Shall I move the ARM initialization phase to BOOKKEEPER-319 and will work there?

          So that will make this JIRA completely independent and will discuss only the auditing job. Are you agreeing with me?

          Also I will get some more time to familiarize with the guava service.

          Show
          Rakesh R added a comment - Hi Ivan, Hope I'm not confusing you As per the discussion/comments, now AutoRecoveryManager will have dependency with RW(that is still under dev stage) and also will make use of guava service for cleaner. Shall I move the ARM initialization phase to BOOKKEEPER-319 and will work there? So that will make this JIRA completely independent and will discuss only the auditing job. Are you agreeing with me? Also I will get some more time to familiarize with the guava service.
          Hide
          Ivan Kelly added a comment -

          If I understand correctly, on NodeExistsException we are trying to append the missingReplica to the underreplicated ledger so that will notifies about one more down bookie which contains the ledger copy.
          If we return simply, then the setData() method will not be called and there is a chance of missing the info about second replica.

          For Ex: L00001 ensemble BK1, BK2, BK3.
          Say BK1 fails initially, then will markUnderreplicated ledger as L000001(BK1 as the data).
          Now again BK2 has failed, then while creating will get NEE, so we will append BK2 also like: L000001(BK1 BK2).

          I think "break; statement" is making sense and after that the duplicate entry addition should be removed as per my latest patch.
          Am I missing anything?

          The point of this code is, that if the node already exists, then there is already a missing replica. We loop through the missingReplicas, to see if the new missingReplica is already there or not. If so, then we can assume someone else has reported this replica missing, so we return.

          Yeah I'll add main method also. But what about retaining start() and stop() method as public. In future this will allow others(any external entity) to manage the recovery process easily ?

          start() and stop() as public is fine. But if initialization fails, the ctor should throw and exception. This way, the null checks are unneeded.

          Regarding Service, we can look at that again after this is in. See my comment on BOOKKEEPER-247.

          Show
          Ivan Kelly added a comment - If I understand correctly, on NodeExistsException we are trying to append the missingReplica to the underreplicated ledger so that will notifies about one more down bookie which contains the ledger copy. If we return simply, then the setData() method will not be called and there is a chance of missing the info about second replica. For Ex: L00001 ensemble BK1, BK2, BK3. Say BK1 fails initially, then will markUnderreplicated ledger as L000001(BK1 as the data). Now again BK2 has failed, then while creating will get NEE, so we will append BK2 also like: L000001(BK1 BK2). I think "break; statement" is making sense and after that the duplicate entry addition should be removed as per my latest patch. Am I missing anything? The point of this code is, that if the node already exists, then there is already a missing replica. We loop through the missingReplicas, to see if the new missingReplica is already there or not. If so, then we can assume someone else has reported this replica missing, so we return. Yeah I'll add main method also. But what about retaining start() and stop() method as public. In future this will allow others(any external entity) to manage the recovery process easily ? start() and stop() as public is fine. But if initialization fails, the ctor should throw and exception. This way, the null checks are unneeded. Regarding Service, we can look at that again after this is in. See my comment on BOOKKEEPER-247 .
          Hide
          Rakesh R added a comment -

          Thanks for the clarifications. Just commented the fix I'm thinking. Could you have a look.

          The point of this code is, that if the node already exists, then there is already a missing replica. We loop through the missingReplicas, to see if the new missingReplica is already there or not. If so, then we can assume someone else has reported this replica missing, so we return.

          so we have two scenarios:

          1. L0001 contains only BK1. while marking missingReplica of BK2, got NEE. Assume there is already a missing replica. Silently return as you told
          2. L0001 contains only BK2. while marking missingReplica of BK2, got NEE. Assume only single auditor and no other is marking. So again we need to merge to the zkMetadata and update in zk.

          On NEE,
          check whether missing replica has already present in the zk urLedger metadata. If yes return otherwise merge the missing replica to the urLedger missingreplicas and call setData()

          try {
                 byte[] bytes = zkc.getData(znode, false, s);
                 String existingMissingReplicas = new String(bytes, UTF8);
                 if(existingMissingReplicas.contains(missingReplica)){
                        return;
                 }
                 TextFormat.merge(existingMissingReplicas, builder);
                 zkc.setData(znode,
                            TextFormat.printToString(builder.build()).getBytes(UTF8),
                            s.getVersion());
                 return;
          }catch (KeeperException.NoNodeException nne) {
          
          Show
          Rakesh R added a comment - Thanks for the clarifications. Just commented the fix I'm thinking. Could you have a look. The point of this code is, that if the node already exists, then there is already a missing replica. We loop through the missingReplicas, to see if the new missingReplica is already there or not. If so, then we can assume someone else has reported this replica missing, so we return. so we have two scenarios: L0001 contains only BK1. while marking missingReplica of BK2, got NEE. Assume there is already a missing replica. Silently return as you told L0001 contains only BK2. while marking missingReplica of BK2, got NEE. Assume only single auditor and no other is marking. So again we need to merge to the zkMetadata and update in zk. On NEE, check whether missing replica has already present in the zk urLedger metadata. If yes return otherwise merge the missing replica to the urLedger missingreplicas and call setData() try { byte [] bytes = zkc.getData(znode, false , s); String existingMissingReplicas = new String (bytes, UTF8); if (existingMissingReplicas.contains(missingReplica)){ return ; } TextFormat.merge(existingMissingReplicas, builder); zkc.setData(znode, TextFormat.printToString(builder.build()).getBytes(UTF8), s.getVersion()); return ; } catch (KeeperException.NoNodeException nne) {
          Hide
          Ivan Kelly added a comment -

          L0001 contains only BK1. while marking missingReplica of BK2, got NEE. Assume there is already a missing replica. Silently return as you told

          No, In this case we need to add BK2 as a missingReplica also.

          L0001 contains only BK2. while marking missingReplica of BK2, got NEE. Assume only single auditor and no other is marking. So again we need to merge to the zkMetadata and update in zk.

          In this case we want to return. BK2 is already marked as a missingReplica, so it will be picked up for replication anyhow.

          The necessary change is:

                          } catch (KeeperException.NodeExistsException nee) {
                              Stat s = zkc.exists(znode, false);
                              if (s == null) {
                                  continue;
                              }
                              try {
                                  byte[] bytes = zkc.getData(znode, false, s);
                                  TextFormat.merge(new String(bytes, UTF8), builder);
                                  UnderreplicatedLedgerFormat data = builder.build();
                                  for (String r : data.getReplicaList()) {
                                      if (r.equals(missingReplica)) {
          -                                break; // nothing to add
          +                                return; // nothing to add
                                      }
                                  }
                                  builder.addReplica(missingReplica);
                                  zkc.setData(znode,
                                              TextFormat.printToString(builder.build()).getBytes(UTF8),
                                              s.getVersion());
                              } catch (KeeperException.NoNodeException nne) {
                                  continue;
                              } catch (KeeperException.BadVersionException bve) {
                                  continue;
                              } catch (TextFormat.ParseException pe) {
                                  throw new ReplicationException.UnavailableException(
                                          "Invalid data found", pe);
                              }
                          }
                          break;
                      }
          
          Show
          Ivan Kelly added a comment - L0001 contains only BK1. while marking missingReplica of BK2, got NEE. Assume there is already a missing replica. Silently return as you told No, In this case we need to add BK2 as a missingReplica also. L0001 contains only BK2. while marking missingReplica of BK2, got NEE. Assume only single auditor and no other is marking. So again we need to merge to the zkMetadata and update in zk. In this case we want to return. BK2 is already marked as a missingReplica, so it will be picked up for replication anyhow. The necessary change is: } catch (KeeperException.NodeExistsException nee) { Stat s = zkc.exists(znode, false ); if (s == null ) { continue ; } try { byte [] bytes = zkc.getData(znode, false , s); TextFormat.merge( new String (bytes, UTF8), builder); UnderreplicatedLedgerFormat data = builder.build(); for ( String r : data.getReplicaList()) { if (r.equals(missingReplica)) { - break ; // nothing to add + return ; // nothing to add } } builder.addReplica(missingReplica); zkc.setData(znode, TextFormat.printToString(builder.build()).getBytes(UTF8), s.getVersion()); } catch (KeeperException.NoNodeException nne) { continue ; } catch (KeeperException.BadVersionException bve) { continue ; } catch (TextFormat.ParseException pe) { throw new ReplicationException.UnavailableException( "Invalid data found" , pe); } } break ; }
          Hide
          Rakesh R added a comment -

          Oh! yeah, there was a small mistake in my previous comment with BK1. Please read the scenario as:

          L0001 contains only BK1. while marking missingReplica of BK2, got NEE. Assume only single auditor and no other is marking. So again we need to merge to the zkMetadata and update in zk.

          Show
          Rakesh R added a comment - Oh! yeah, there was a small mistake in my previous comment with BK1. Please read the scenario as: L0001 contains only BK1. while marking missingReplica of BK2, got NEE. Assume only single auditor and no other is marking. So again we need to merge to the zkMetadata and update in zk.
          Hide
          Ivan Kelly added a comment -

          Yes, the code I posted does exactly that. "Merging" is simply adding if it it doesn't already exist.

          Show
          Ivan Kelly added a comment - Yes, the code I posted does exactly that. "Merging" is simply adding if it it doesn't already exist.
          Hide
          Rakesh R added a comment -

          Thanks Ivan. I'll do the changes by just return if it matches as you suggested. I'll investigate it later.

          Auditor will be a top level daemon, so it should own its create it's own LedgerManagers and ZooKeeper client. The only thing passed in to is should be the configuration.

          Anyway we are planning to have initialization class which does starting Auditor and RW threads. Would you agree to create ledgerManagers there and pass it to the Auditor and RW daemons?

          In that case, shall I keep the ctor as it is. (by taking managers)

          public Auditor(String bookieIdentifier, AbstractConfiguration conf,
                      ZooKeeper zkc, LedgerManager ledgerManager,
                      LedgerUnderreplicationManager ledgerUnderreplicationManager) {
          
          Show
          Rakesh R added a comment - Thanks Ivan. I'll do the changes by just return if it matches as you suggested. I'll investigate it later. Auditor will be a top level daemon, so it should own its create it's own LedgerManagers and ZooKeeper client. The only thing passed in to is should be the configuration. Anyway we are planning to have initialization class which does starting Auditor and RW threads. Would you agree to create ledgerManagers there and pass it to the Auditor and RW daemons? In that case, shall I keep the ctor as it is. (by taking managers) public Auditor( String bookieIdentifier, AbstractConfiguration conf, ZooKeeper zkc, LedgerManager ledgerManager, LedgerUnderreplicationManager ledgerUnderreplicationManager) {
          Hide
          Rakesh R added a comment -

          Attached latest patch addressing Ivan's comment and also refactored tests.

          Hi Ivan, would be great to know your feedback.

          Thanks,
          Rakesh

          Show
          Rakesh R added a comment - Attached latest patch addressing Ivan's comment and also refactored tests. Hi Ivan, would be great to know your feedback. Thanks, Rakesh
          Hide
          Rakesh R added a comment -

          Updates:

          I've raised another issue BOOKKEEPER-380 for discussing the scenario about updating duplicate missing replicas in ZkLedgerUnderreplicationManager.

          Also, I rebased the patch for this issue by making it independent.

          Show
          Rakesh R added a comment - Updates: I've raised another issue BOOKKEEPER-380 for discussing the scenario about updating duplicate missing replicas in ZkLedgerUnderreplicationManager. Also, I rebased the patch for this issue by making it independent.
          Hide
          Ivan Kelly added a comment - - edited

          Anyway we are planning to have initialization class which does starting Auditor and RW threads. Would you agree to create ledgerManagers there and pass it to the Auditor and RW daemons?

          I think both auditor and RW daemon should have their own instances of the managers. We're only putting them in the same process for convenience. Otherwise they should be completely isolated except for communication though zookeeper.

          ZK_MAX_RETRY_TIMES - how can you still be auditor if you've lost your zk connection? Have you guys actually seen intermittency with zookeeper? I dont think it's necessary to try and handle it. If the zk connection goes away, you lose your lock, someone else becomes auditor. The auditor does a scan when it comes online in any case, so it's not like we would miss an event.

          Show
          Ivan Kelly added a comment - - edited Anyway we are planning to have initialization class which does starting Auditor and RW threads. Would you agree to create ledgerManagers there and pass it to the Auditor and RW daemons? I think both auditor and RW daemon should have their own instances of the managers. We're only putting them in the same process for convenience. Otherwise they should be completely isolated except for communication though zookeeper. ZK_MAX_RETRY_TIMES - how can you still be auditor if you've lost your zk connection? Have you guys actually seen intermittency with zookeeper? I dont think it's necessary to try and handle it. If the zk connection goes away, you lose your lock, someone else becomes auditor. The auditor does a scan when it comes online in any case, so it's not like we would miss an event.
          Hide
          Rakesh R added a comment -

          Hi Ivan, Attached patch addressing the above comments.

          I think both auditor and RW daemon

          Now similar to the RW daemon, Auditor also initializing the managers himself.

          ZK_MAX_RETRY_TIMES - how can you still be auditor if you've lost your zk connection?

          Yeah, its true. I've removed the retry logic and make the publishing logic simple

          Could you have a look at the latest patch.

          Show
          Rakesh R added a comment - Hi Ivan, Attached patch addressing the above comments. I think both auditor and RW daemon Now similar to the RW daemon, Auditor also initializing the managers himself. ZK_MAX_RETRY_TIMES - how can you still be auditor if you've lost your zk connection? Yeah, its true. I've removed the retry logic and make the publishing logic simple Could you have a look at the latest patch.
          Hide
          Ivan Kelly added a comment -

          Committed as 1377716. Good work Rakesh.

          Show
          Ivan Kelly added a comment - Committed as 1377716. Good work Rakesh.
          Hide
          Hudson added a comment -

          Integrated in bookkeeper-trunk #674 (See https://builds.apache.org/job/bookkeeper-trunk/674/)
          BOOKKEEPER-272: Provide automatic mechanism to know bookie failures (rakeshr via ivank) (Revision 1377716)

          Result = FAILURE
          ivank :
          Files :

          • /zookeeper/bookkeeper/trunk/CHANGES.txt
          • /zookeeper/bookkeeper/trunk/bookkeeper-server/src/main/java/org/apache/bookkeeper/replication/Auditor.java
          • /zookeeper/bookkeeper/trunk/bookkeeper-server/src/main/java/org/apache/bookkeeper/replication/AuditorElector.java
          • /zookeeper/bookkeeper/trunk/bookkeeper-server/src/test/java/org/apache/bookkeeper/replication/AuditorBookieTest.java
          • /zookeeper/bookkeeper/trunk/bookkeeper-server/src/test/java/org/apache/bookkeeper/replication/AuditorLedgerCheckerTest.java
          Show
          Hudson added a comment - Integrated in bookkeeper-trunk #674 (See https://builds.apache.org/job/bookkeeper-trunk/674/ ) BOOKKEEPER-272 : Provide automatic mechanism to know bookie failures (rakeshr via ivank) (Revision 1377716) Result = FAILURE ivank : Files : /zookeeper/bookkeeper/trunk/CHANGES.txt /zookeeper/bookkeeper/trunk/bookkeeper-server/src/main/java/org/apache/bookkeeper/replication/Auditor.java /zookeeper/bookkeeper/trunk/bookkeeper-server/src/main/java/org/apache/bookkeeper/replication/AuditorElector.java /zookeeper/bookkeeper/trunk/bookkeeper-server/src/test/java/org/apache/bookkeeper/replication/AuditorBookieTest.java /zookeeper/bookkeeper/trunk/bookkeeper-server/src/test/java/org/apache/bookkeeper/replication/AuditorLedgerCheckerTest.java

            People

            • Assignee:
              Rakesh R
              Reporter:
              Rakesh R
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development