Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.3.3, 3.4.0
    • Component/s: contrib-bookkeeper
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      It is currently easy to know how many entries a ledger has, but there is no easy way to know the total number of bytes in a ledger. The idea of this jira is to add a method that gives the number of bytes in a closed ledger. My current idea is to simply have the writer counting the number of bytes written and store it to ZooKeeper.

      1. ZOOKEEPER-465-3.3.patch
        18 kB
        Flavio Junqueira
      2. ZOOKEEPER-465.patch
        17 kB
        Flavio Junqueira
      3. ZOOKEEPER-465.patch
        17 kB
        Flavio Junqueira
      4. ZOOKEEPER-465.patch
        18 kB
        Flavio Junqueira
      5. ZOOKEEPER-465.patch
        18 kB
        Flavio Junqueira

        Activity

        Hide
        dhruba borthakur added a comment -

        This will ve very helpful for many applications. I am writing an application that needs this support. if somebody already has a patch to put in this API to bookkeeper, I would appreciate that a lot. Otherwise, I will do it myself in the near future and submit a patch.

        Show
        dhruba borthakur added a comment - This will ve very helpful for many applications. I am writing an application that needs this support. if somebody already has a patch to put in this API to bookkeeper, I would appreciate that a lot. Otherwise, I will do it myself in the near future and submit a patch.
        Hide
        dhruba borthakur added a comment -

        Please pardon my ignorance, but your approach works because nobody can append new entries to a ledger after it is closed, right?

        Also, what will be the semantics of a concurrent reader? For example, if process A is writing to a ledger while at the same time another process B wants to find the length of the ledger, Is this possible?

        Show
        dhruba borthakur added a comment - Please pardon my ignorance, but your approach works because nobody can append new entries to a ledger after it is closed, right? Also, what will be the semantics of a concurrent reader? For example, if process A is writing to a ledger while at the same time another process B wants to find the length of the ledger, Is this possible?
        Hide
        dhruba borthakur added a comment -

        > My current idea is to simply have the writer counting the number of bytes written and store it to ZooKeeper

        Does this mean that every addEntry call to the ledger will result in an additional zk call to update the ledger length?

        Show
        dhruba borthakur added a comment - > My current idea is to simply have the writer counting the number of bytes written and store it to ZooKeeper Does this mean that every addEntry call to the ledger will result in an additional zk call to update the ledger length?
        Hide
        Flavio Junqueira added a comment -

        Hi Dhruba, When I wrote the description, I think I was referring to writing to ZooKeeper once we close a ledger, so we wouldn't have to pay the price of a ZooKeeper update upon each addEntry. However, thinking again about the problem, this approach is not fault tolerant. If the client writer crashes before closing and the byte count is volatile, then we will lose it.

        One way I see to overcome this problem is having each bookie keep the byte count for its ledger fragment. Given the byte count B for a ledger fragment, we can obtain an estimate of the total number by computing (B * n/r), where n is the number of bookies storing the ledger and r is the replication factor of each entry. This last formula comes from the observation that each bookie stores r/n entries of a ledger.

        This approach, however, does not provide a good estimate if the length of entries varies significantly. A less efficient approach that doesn't have the imbalance problem is reading the byte counts from all bookies, adding them up, and dividing by the replication factor. This operation will only complete if no bookie is faulty. In the case we have a faulty bookie, we have a procedure to recover the ledger fragments of a faulty bookie.

        Assuming that there are bookies that have crashed and their fragments haven't been replicated to new bookies, the best I can think of at this point is taking the average over the bookies that are up and performing the same computation above.

        Any other option I'm missing?

        Show
        Flavio Junqueira added a comment - Hi Dhruba, When I wrote the description, I think I was referring to writing to ZooKeeper once we close a ledger, so we wouldn't have to pay the price of a ZooKeeper update upon each addEntry. However, thinking again about the problem, this approach is not fault tolerant. If the client writer crashes before closing and the byte count is volatile, then we will lose it. One way I see to overcome this problem is having each bookie keep the byte count for its ledger fragment. Given the byte count B for a ledger fragment, we can obtain an estimate of the total number by computing (B * n/r), where n is the number of bookies storing the ledger and r is the replication factor of each entry. This last formula comes from the observation that each bookie stores r/n entries of a ledger. This approach, however, does not provide a good estimate if the length of entries varies significantly. A less efficient approach that doesn't have the imbalance problem is reading the byte counts from all bookies, adding them up, and dividing by the replication factor. This operation will only complete if no bookie is faulty. In the case we have a faulty bookie, we have a procedure to recover the ledger fragments of a faulty bookie. Assuming that there are bookies that have crashed and their fragments haven't been replicated to new bookies, the best I can think of at this point is taking the average over the bookies that are up and performing the same computation above. Any other option I'm missing?
        Hide
        dhruba borthakur added a comment -

        Maybe a brain-dead idea: but can we write the current size of the ledger along with each entry in the ledger? So, if an application wants to find the size of a ledger, the system can internally read the last entry in the ledger and extract the length of the ledger from it.

        Show
        dhruba borthakur added a comment - Maybe a brain-dead idea: but can we write the current size of the ledger along with each entry in the ledger? So, if an application wants to find the size of a ledger, the system can internally read the last entry in the ledger and extract the length of the ledger from it.
        Hide
        Flavio Junqueira added a comment -

        That's a good approach. I was trying to avoid the overhead of adding a long to every entry, but it surely works.

        Show
        Flavio Junqueira added a comment - That's a good approach. I was trying to avoid the overhead of adding a long to every entry, but it surely works.
        Hide
        Mahadev konar added a comment -

        +1, thats a nice idea. Is this a big change flavio?

        Show
        Mahadev konar added a comment - +1, thats a nice idea. Is this a big change flavio?
        Hide
        Mahadev konar added a comment -

        Marking it for 3.3.3 to see if we can have this in the next release to let folks use it.

        Show
        Mahadev konar added a comment - Marking it for 3.3.3 to see if we can have this in the next release to let folks use it.
        Hide
        Flavio Junqueira added a comment -

        For Dhruba's proposal, the changes would be concentrated on the DigestManager class.

        1. To add the timestamp to each entry, we would have to do it here: DigestManager.computeDigestAndPackageForSending().
        2. To remove the timestamp, we would need to modify DigestManager.verifyDigestAndReturnData().

        One thing I noticed is that we are using this constant 24 all over the place in the DigestManager class, and it is 24 because we currently add 3 longs to an entry. It might be a good idea in this patch to name that constant and use the name instead of the value directly.

        Show
        Flavio Junqueira added a comment - For Dhruba's proposal, the changes would be concentrated on the DigestManager class. To add the timestamp to each entry, we would have to do it here: DigestManager.computeDigestAndPackageForSending(). To remove the timestamp, we would need to modify DigestManager.verifyDigestAndReturnData(). One thing I noticed is that we are using this constant 24 all over the place in the DigestManager class, and it is 24 because we currently add 3 longs to an entry. It might be a good idea in this patch to name that constant and use the name instead of the value directly.
        Hide
        dhruba borthakur added a comment -

        > can we write the current size of the ledger along with each entry in the ledger?

        This might not work very well if there are multiple clients writing to the same ledger concurrently, especially because the Bookies do not keep state for the clients. am i understanding it right?

        Show
        dhruba borthakur added a comment - > can we write the current size of the ledger along with each entry in the ledger? This might not work very well if there are multiple clients writing to the same ledger concurrently, especially because the Bookies do not keep state for the clients. am i understanding it right?
        Hide
        Flavio Junqueira added a comment -

        We assume that there is a single client writing to a ledger. Do you have a use case for multiple clients writing to the same ledger? You may consider instead having multiple ledgers being written concurrently, using the same set of bookies. We have tested with up to 20,000 concurrent ledgers and we didn't notice a performance penalty on the aggregate performance. I'm not sure if it suits your case, though.

        Show
        Flavio Junqueira added a comment - We assume that there is a single client writing to a ledger. Do you have a use case for multiple clients writing to the same ledger? You may consider instead having multiple ledgers being written concurrently, using the same set of bookies. We have tested with up to 20,000 concurrent ledgers and we didn't notice a performance penalty on the aggregate performance. I'm not sure if it suits your case, though.
        Hide
        dhruba borthakur added a comment -

        In my case, a ledger will have only one writer. Then we are still good with the approach where the client can maintain the current ledger size?

        Just a hypothetical question: does Bookkeeper make any assumptions that there is only one writer to a ledger? if so, then what happens if a second client starts writing to the same ledger, will it get an error?

        Show
        dhruba borthakur added a comment - In my case, a ledger will have only one writer. Then we are still good with the approach where the client can maintain the current ledger size? Just a hypothetical question: does Bookkeeper make any assumptions that there is only one writer to a ledger? if so, then what happens if a second client starts writing to the same ledger, will it get an error?
        Hide
        Flavio Junqueira added a comment -

        To write to a ledger, a client must first create the ledger, and only one client can create a ledger with a given identifier. We enforce this behavior by using the sequence flag in ZooKeeper when creating the znode that will represent the newly created ledger. Bookies, however, do not enforce that a single client writes to a ledger, so we rely upon the behavior of the client.

        Show
        Flavio Junqueira added a comment - To write to a ledger, a client must first create the ledger, and only one client can create a ledger with a given identifier. We enforce this behavior by using the sequence flag in ZooKeeper when creating the znode that will represent the newly created ledger. Bookies, however, do not enforce that a single client writes to a ledger, so we rely upon the behavior of the client.
        Hide
        Flavio Junqueira added a comment -

        Attaching a functional patch, including tests. The only part still missing is documentation, but otherwise it should be fine.

        Show
        Flavio Junqueira added a comment - Attaching a functional patch, including tests. The only part still missing is documentation, but otherwise it should be fine.
        Hide
        dhruba borthakur added a comment -

        Thanks Flavio for doing this. If it is not too much work for you, can you pl upload this patch in reviewboard, just makes it easier to review? http://reviews.apache.org/.

        Show
        dhruba borthakur added a comment - Thanks Flavio for doing this. If it is not too much work for you, can you pl upload this patch in reviewboard, just makes it easier to review? http://reviews.apache.org/ .
        Hide
        Flavio Junqueira added a comment -

        I'm not being able to create a new review request and it seems to be due to moving the repository recently. I have reported to infra, so let's see what they say.

        Show
        Flavio Junqueira added a comment - I'm not being able to create a new review request and it seems to be due to moving the repository recently. I have reported to infra, so let's see what they say.
        Hide
        Benjamin Reed added a comment -

        sorry it took so long to review. i've added some notes at https://reviews.apache.org/r/234/

        Show
        Benjamin Reed added a comment - sorry it took so long to review. i've added some notes at https://reviews.apache.org/r/234/
        Hide
        Flavio Junqueira added a comment -

        This patch addresses Ben's comments.

        Show
        Flavio Junqueira added a comment - This patch addresses Ben's comments.
        Hide
        Flavio Junqueira added a comment -

        Removed some tabs.

        Show
        Flavio Junqueira added a comment - Removed some tabs.
        Hide
        Benjamin Reed added a comment -

        sorry to be a pain flavio, but can i get a couple very minor changes?

        1) in PendingReadOp can you comment what you substract 8 on the line:

        +        entry.length = buffer.getLong(DigestManager.METADATA_LENGTH - 8);
        

        2) in LedgerRecoveryOp can you comment on the line:

        +            lh.length = entry.getLength() - (long) data.length;
        

        i think it is because you are going to call add entry on that data, so it will be added back, right?

        3) we have been focusing a bit on the info levels of log messages and i think the following message should be debug:

        +            LOG.info("After closing length is: " + lh.getLength());
        
        Show
        Benjamin Reed added a comment - sorry to be a pain flavio, but can i get a couple very minor changes? 1) in PendingReadOp can you comment what you substract 8 on the line: + entry.length = buffer.getLong(DigestManager.METADATA_LENGTH - 8); 2) in LedgerRecoveryOp can you comment on the line: + lh.length = entry.getLength() - ( long ) data.length; i think it is because you are going to call add entry on that data, so it will be added back, right? 3) we have been focusing a bit on the info levels of log messages and i think the following message should be debug: + LOG.info( "After closing length is: " + lh.getLength());
        Hide
        Benjamin Reed added a comment -

        btw, dhruba, is this what you were looking for?

        Show
        Benjamin Reed added a comment - btw, dhruba, is this what you were looking for?
        Hide
        dhruba borthakur added a comment -

        This patch looks good, +1. The client-writer is maintaining the length of the ledger till the ledger is closed.

        on a related note, there is something that is confusing me. suppose a client-writer is writing data to 5 bookies. If one bookie dies, the client continues to write data to the remaining 4 bookies. Then the ledger is closed. Now, bookie 5 has partial data because many writes did not make it to that node. How does a system know to ignore the data from bookie 5?

        Show
        dhruba borthakur added a comment - This patch looks good, +1. The client-writer is maintaining the length of the ledger till the ledger is closed. on a related note, there is something that is confusing me. suppose a client-writer is writing data to 5 bookies. If one bookie dies, the client continues to write data to the remaining 4 bookies. Then the ledger is closed. Now, bookie 5 has partial data because many writes did not make it to that node. How does a system know to ignore the data from bookie 5?
        Hide
        Flavio Junqueira added a comment -

        Addressing Ben's comments.

        Show
        Flavio Junqueira added a comment - Addressing Ben's comments.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12470630/ZOOKEEPER-465.patch
        against trunk revision 1068067.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 9 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/125//testReport/
        Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/125//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/125//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12470630/ZOOKEEPER-465.patch against trunk revision 1068067. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/125//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/125//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/125//console This message is automatically generated.
        Hide
        Benjamin Reed added a comment -

        +1 looks good! thanx for making the changes.

        Show
        Benjamin Reed added a comment - +1 looks good! thanx for making the changes.
        Hide
        Mahadev konar added a comment -

        flavio I committed this code to trunk but the patch doesnt apply for 3.3 branch. Do you have a patch for that?

        Show
        Mahadev konar added a comment - flavio I committed this code to trunk but the patch doesnt apply for 3.3 branch. Do you have a patch for that?
        Hide
        Flavio Junqueira added a comment -

        I wonder if it is worth having a 3.3 version of this patch. There are new features missing on 3.3 already, like ZOOKEEPER-712. I wanted to have a new release of BookKeeper with these new features, but 3.4.0 seems not to be around the corner.

        Show
        Flavio Junqueira added a comment - I wonder if it is worth having a 3.3 version of this patch. There are new features missing on 3.3 already, like ZOOKEEPER-712 . I wanted to have a new release of BookKeeper with these new features, but 3.4.0 seems not to be around the corner.
        Hide
        Mahadev konar added a comment -

        flavio, I think it would be great to have it in 3.3.3 so that folks who want to get the size of the ledger (like dhruba) can start using this. I can work with you to come up with a patch for 3.3.3 branch. ZOOKEEPER-712 though important but doesnt block development on top of bookkeeper but this feature might. what do you think?

        Show
        Mahadev konar added a comment - flavio, I think it would be great to have it in 3.3.3 so that folks who want to get the size of the ledger (like dhruba) can start using this. I can work with you to come up with a patch for 3.3.3 branch. ZOOKEEPER-712 though important but doesnt block development on top of bookkeeper but this feature might. what do you think?
        Hide
        dhruba borthakur added a comment -

        I might have to start using trunk because ZOOKEEPER-712 is essential to store hdfs logs in bookkeeper, isn't it? Especially to recover from a scenario when the namenode died/killed/crashed before closing the ledger?

        Show
        dhruba borthakur added a comment - I might have to start using trunk because ZOOKEEPER-712 is essential to store hdfs logs in bookkeeper, isn't it? Especially to recover from a scenario when the namenode died/killed/crashed before closing the ledger?
        Hide
        Flavio Junqueira added a comment -

        The bookie recovery feature (ZOOKEEPER-712) is to recover ledger fragments upon a bookie crash. Let's say your ledger is written to 4 bookies, and later bookie number 4 crashes. To avoid having the ledger going away because of multiple bookie failures, we have this facility that enables one to recreate the ledger fragments of a faulty bookie.

        The feature you're referring to is ledger recovery (as opposed to bookie recovery), and it has been there for a while. Both trunk and 3.3 have it.

        Show
        Flavio Junqueira added a comment - The bookie recovery feature ( ZOOKEEPER-712 ) is to recover ledger fragments upon a bookie crash. Let's say your ledger is written to 4 bookies, and later bookie number 4 crashes. To avoid having the ledger going away because of multiple bookie failures, we have this facility that enables one to recreate the ledger fragments of a faulty bookie. The feature you're referring to is ledger recovery (as opposed to bookie recovery), and it has been there for a while. Both trunk and 3.3 have it.
        Hide
        dhruba borthakur added a comment -

        Ok, got it, thanks flavio

        Show
        dhruba borthakur added a comment - Ok, got it, thanks flavio
        Hide
        Hudson added a comment -

        Integrated in ZooKeeper-trunk #1093 (See https://hudson.apache.org/hudson/job/ZooKeeper-trunk/1093/)
        ZOOKEEPER-465. Ledger size in bytes. (flavio via mahadev)

        Show
        Hudson added a comment - Integrated in ZooKeeper-trunk #1093 (See https://hudson.apache.org/hudson/job/ZooKeeper-trunk/1093/ ) ZOOKEEPER-465 . Ledger size in bytes. (flavio via mahadev)
        Hide
        Flavio Junqueira added a comment -

        Let me add that I'm fine with having this feature without bookie recovery if this is useful for anyone. I'll produce a patch for 3.3.

        Show
        Flavio Junqueira added a comment - Let me add that I'm fine with having this feature without bookie recovery if this is useful for anyone. I'll produce a patch for 3.3.
        Hide
        Benjamin Reed added a comment -

        has anyone tried applying ZOOKEEPER-712 to 3.3?

        Show
        Benjamin Reed added a comment - has anyone tried applying ZOOKEEPER-712 to 3.3?
        Hide
        Flavio Junqueira added a comment -

        It is a good observation, I have just reopened ZOOKEEPER-712 and uploaded a patch for 3.3. Could anyone please review/commit? I'll produce a patch for this jira for 3.3 once ZOOKEEPER-712 gets in.

        Show
        Flavio Junqueira added a comment - It is a good observation, I have just reopened ZOOKEEPER-712 and uploaded a patch for 3.3. Could anyone please review/commit? I'll produce a patch for this jira for 3.3 once ZOOKEEPER-712 gets in.
        Hide
        Mahadev konar added a comment -

        thanks flavio, Ill review ZOOKEEPER-712 and commit.

        Show
        Mahadev konar added a comment - thanks flavio, Ill review ZOOKEEPER-712 and commit.
        Hide
        Mahadev konar added a comment -

        flavio, can you please upload a patch for 3.3 branch?

        Show
        Mahadev konar added a comment - flavio, can you please upload a patch for 3.3 branch?
        Hide
        Flavio Junqueira added a comment -

        Patch for 3.3.

        Show
        Flavio Junqueira added a comment - Patch for 3.3.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12471340/ZOOKEEPER-465-3.3.patch
        against trunk revision 1069169.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 9 new or modified tests.

        -1 patch. The patch command could not apply the patch.

        Console output: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/137//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12471340/ZOOKEEPER-465-3.3.patch against trunk revision 1069169. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/137//console This message is automatically generated.
        Hide
        Mahadev konar added a comment -

        I committed this to 3.3 as well. thanks flavio.

        Show
        Mahadev konar added a comment - I committed this to 3.3 as well. thanks flavio.

          People

          • Assignee:
            Flavio Junqueira
            Reporter:
            Flavio Junqueira
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development