ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-882

Startup loads last transaction from snapshot

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.3.3, 3.4.0
    • Component/s: server
    • Labels:
      None

      Description

      On startup, the server first loads the latest snapshot, and then loads from the log starting at the last transaction in the snapshot. It should begin from one past that last transaction in the log. I will attach a possible patch.

      1. ZOOKEEPER-882.patch
        2 kB
        Jared Cantwell
      2. ZOOKEEPER-882.patch
        6 kB
        Jared Cantwell
      3. restore
        0.8 kB
        Jared Cantwell
      4. FailureTest-882.patch
        5 kB
        Jared Cantwell
      5. 882.diff
        1 kB
        Jared Cantwell

        Activity

        Hide
        Jared Cantwell added a comment -

        A simple patch for consideration.

        Show
        Jared Cantwell added a comment - A simple patch for consideration.
        Hide
        Flavio Junqueira added a comment -

        Hi Jared, Thanks for bringing this up. It doesn't look like that extra call to next() is necessary. If there is another file to process, then the call to next will return true and we will keep processing transactions, no?

        Show
        Flavio Junqueira added a comment - Hi Jared, Thanks for bringing this up. It doesn't look like that extra call to next() is necessary. If there is another file to process, then the call to next will return true and we will keep processing transactions, no?
        Hide
        Flavio Junqueira added a comment -

        I'm also not clear on your second point. If you check FileTxnIterator.init(), then it seems to me that the zxid passed as a parameter should be included, so not dt.lastProcessedZxid+1. What am I missing?

        Show
        Flavio Junqueira added a comment - I'm also not clear on your second point. If you check FileTxnIterator.init(), then it seems to me that the zxid passed as a parameter should be included, so not dt.lastProcessedZxid+1. What am I missing?
        Hide
        Jared Cantwell added a comment -

        Maybe I am misunderstanding the FileTxnLog.next() call, but I interpret the following:

        • Based on its usage, the next() call should prepare the next hdr and record, and return true if it did this successfully, and false otherwise.
        • if the catch() executes, it means that we couldn't prepare from the current file, so we need to move to the next
        • goToNextLog() simply swaps the log file, but does not prepare hdr and record, so the current ones will remain current and get processed a second time

        If any of this is wrong, than my patch probably doesn't make sense, so please let me know.

        As far as the second point, I agree that init() is inclusive. So you want to pass in as a parameter something that has not yet been processed. So, if we pass in lastProcessedZxid (which is already in the snapshot), then that will be read from the log also since init() starts at that transaction, not one past it. Is that right?

        Show
        Jared Cantwell added a comment - Maybe I am misunderstanding the FileTxnLog.next() call, but I interpret the following: Based on its usage, the next() call should prepare the next hdr and record, and return true if it did this successfully, and false otherwise. if the catch() executes, it means that we couldn't prepare from the current file, so we need to move to the next goToNextLog() simply swaps the log file, but does not prepare hdr and record, so the current ones will remain current and get processed a second time If any of this is wrong, than my patch probably doesn't make sense, so please let me know. As far as the second point, I agree that init() is inclusive. So you want to pass in as a parameter something that has not yet been processed. So, if we pass in lastProcessedZxid (which is already in the snapshot), then that will be read from the log also since init() starts at that transaction, not one past it. Is that right?
        Hide
        Flavio Junqueira added a comment -

        I agree with your description of the behavior of next, and sounds right to me that we should be setting hdr and calling "return next();" at the end of the catch block.

        Regarding init(), we first use the value of zxid to determine which log files to read: all log files tagged with a value higher than zxid and the last log file that is less than zxid. Next we iterate over the log files until hdr.getZxid() is greater or equal to zxid (should be zxid really). This guarantees that the next call to next(), after init() returns, will return zxid+1. Does it sound right to you?

        Show
        Flavio Junqueira added a comment - I agree with your description of the behavior of next, and sounds right to me that we should be setting hdr and calling "return next();" at the end of the catch block. Regarding init(), we first use the value of zxid to determine which log files to read: all log files tagged with a value higher than zxid and the last log file that is less than zxid. Next we iterate over the log files until hdr.getZxid() is greater or equal to zxid (should be zxid really). This guarantees that the next call to next(), after init() returns, will return zxid+1. Does it sound right to you?
        Hide
        Jared Cantwell added a comment -

        On second look, I agree with the way that works. I think the bug in fact is later in FileSnapLog.restore() because next() is never called before the loop starts executing, so the first transaction processed is the one that TxnLog was initialized with. Do you see that also?

        Show
        Jared Cantwell added a comment - On second look, I agree with the way that works. I think the bug in fact is later in FileSnapLog.restore() because next() is never called before the loop starts executing, so the first transaction processed is the one that TxnLog was initialized with. Do you see that also?
        Hide
        Jared Cantwell added a comment -

        Maybe restore() should look like the attached (not a diff) instead.

        Show
        Jared Cantwell added a comment - Maybe restore() should look like the attached (not a diff) instead.
        Hide
        Flavio Junqueira added a comment -

        (I meant to post a comment yesterday, but jira decided to re-index right at the time)

        I like the way you structured the restore loop, it is simpler and easier to read, and I can't find any problem with it. About the severity of the bug, my interpretation is that it is harmless to re-execute the transaction, but still worth proposing a patch.

        Show
        Flavio Junqueira added a comment - (I meant to post a comment yesterday, but jira decided to re-index right at the time) I like the way you structured the restore loop, it is simpler and easier to read, and I can't find any problem with it. About the severity of the bug, my interpretation is that it is harmless to re-execute the transaction, but still worth proposing a patch.
        Hide
        Jared Cantwell added a comment -

        Patch to incorporate recommended changes.

        Show
        Jared Cantwell added a comment - Patch to incorporate recommended changes.
        Hide
        Flavio Junqueira added a comment -

        Hi Jared, I was wondering if you can add a test case to your patch.

        Show
        Flavio Junqueira added a comment - Hi Jared, I was wondering if you can add a test case to your patch.
        Hide
        Jared Cantwell added a comment -

        Sure Flavio. I have a test ready, but I have a question for you. In order to make the test illustrate the bug in the current version, it needs to be changed slightly (~3 lines). Otherwise, it will still fail with the current version, but for the wrong reason. It does, however, verify that transactions are loaded correctly, and once, with the patch. How should I handle this? I was thinking about posting a version of the test for people to verify that this truly is a bug, and then submitting the real test with the patch. How does that sound?

        Show
        Jared Cantwell added a comment - Sure Flavio. I have a test ready, but I have a question for you. In order to make the test illustrate the bug in the current version, it needs to be changed slightly (~3 lines). Otherwise, it will still fail with the current version, but for the wrong reason. It does, however, verify that transactions are loaded correctly, and once, with the patch. How should I handle this? I was thinking about posting a version of the test for people to verify that this truly is a bug, and then submitting the real test with the patch. How does that sound?
        Hide
        Flavio Junqueira added a comment -

        If I understand correctly what you're proposing, I think it won't be necessary to submit two separate patches. To verify that the test fails without the patch, I can simply add the test without applying any other modification in the patch file, and then run the test. After applying the modifications to the code base, I'd be able to verify that the test does not fail any longer. Does it sound right to you?

        Show
        Flavio Junqueira added a comment - If I understand correctly what you're proposing, I think it won't be necessary to submit two separate patches. To verify that the test fails without the patch, I can simply add the test without applying any other modification in the patch file, and then run the test. After applying the modifications to the code base, I'd be able to verify that the test does not fail any longer. Does it sound right to you?
        Hide
        Jared Cantwell added a comment -

        Sorry, I don't think I was clear. That makes sense, but I was saying that if you only apply the test then it will fail (as expected), but for the wrong reason. Due to the loop restructuring, and the semantics of the change, a test to illustrate the current bug would need to be slightly different (~3 lines) than the test I would submit to illustrate correct functionality after the patch.

        Show
        Jared Cantwell added a comment - Sorry, I don't think I was clear. That makes sense, but I was saying that if you only apply the test then it will fail (as expected), but for the wrong reason. Due to the loop restructuring, and the semantics of the change, a test to illustrate the current bug would need to be slightly different (~3 lines) than the test I would submit to illustrate correct functionality after the patch.
        Hide
        Flavio Junqueira added a comment -

        Ok, got it. I agree that it makes sense to submit both. Just name the patch file including the verification test differently and explain the difference in the jira.

        Show
        Flavio Junqueira added a comment - Ok, got it. I agree that it makes sense to submit both. Just name the patch file including the verification test differently and explain the difference in the jira.
        Hide
        Jared Cantwell added a comment -

        Submitting a unit test that illustrates the bug and confirms that transactions are loaded multiple times when moving to the next log file. This test ignores FileTxnSnapLog and focuses on the behavior of FileTxnLog for the moment.

        Show
        Jared Cantwell added a comment - Submitting a unit test that illustrates the bug and confirms that transactions are loaded multiple times when moving to the next log file. This test ignores FileTxnSnapLog and focuses on the behavior of FileTxnLog for the moment.
        Hide
        Flavio Junqueira added a comment -

        Hi Jared, What's the status of this patch? Do you think you can produce a patch for review and submit it?

        Show
        Flavio Junqueira added a comment - Hi Jared, What's the status of this patch? Do you think you can produce a patch for review and submit it?
        Hide
        Jared Cantwell added a comment -

        Flavio, I hit a wall with this a few weeks ago and haven't had a chance to look at it again. I have almost everything in place, except I remember there being an issue on the very first log. I'll look at it again and report back soon.

        Show
        Jared Cantwell added a comment - Flavio, I hit a wall with this a few weeks ago and haven't had a chance to look at it again. I have almost everything in place, except I remember there being an issue on the very first log. I'll look at it again and report back soon.
        Hide
        Jared Cantwell added a comment -

        Flavio, so I figured out what was holding me up before. In an earlier comment you said the following, "Regarding init(), we first use the value of zxid to determine which log files to read: all log files tagged with a value higher than zxid and the last log file that is less than zxid. Next we iterate over the log files until hdr.getZxid() is greater or equal to zxid (should be zxid really). This guarantees that the next call to next(), after init() returns, will return zxid+1." This is true almost all the time. It is true when we have the requested zxid in our log, but NOT if the very first log entry is zxid+1, which can easily be the case in almost any startup scenario. I am betting that the loop in restore() was originally planned to be coded as you stated (by initializing with last lastProcessedTxnID and then calling next()), but then the issue I mentioned came up and it was decided it was ok to just use the first header, and call next afterward. Without rewriting parts of the TxnIterator, I don't think restructuring that loop in restore() is simple. Therefore, I propose that we simply request lastProcessedZxid+1 from the log initially (which is guaranteed to either be present, or have no entries at all in the log). I am about to submit a patch, with a unittest for this.

        Show
        Jared Cantwell added a comment - Flavio, so I figured out what was holding me up before. In an earlier comment you said the following, "Regarding init(), we first use the value of zxid to determine which log files to read: all log files tagged with a value higher than zxid and the last log file that is less than zxid. Next we iterate over the log files until hdr.getZxid() is greater or equal to zxid (should be zxid really). This guarantees that the next call to next(), after init() returns, will return zxid+1." This is true almost all the time. It is true when we have the requested zxid in our log, but NOT if the very first log entry is zxid+1, which can easily be the case in almost any startup scenario. I am betting that the loop in restore() was originally planned to be coded as you stated (by initializing with last lastProcessedTxnID and then calling next()), but then the issue I mentioned came up and it was decided it was ok to just use the first header, and call next afterward. Without rewriting parts of the TxnIterator, I don't think restructuring that loop in restore() is simple. Therefore, I propose that we simply request lastProcessedZxid+1 from the log initially (which is guaranteed to either be present, or have no entries at all in the log). I am about to submit a patch, with a unittest for this.
        Hide
        Jared Cantwell added a comment -

        Attaching patch described in previous comment.

        Show
        Jared Cantwell added a comment - Attaching patch described in previous comment.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12466505/ZOOKEEPER-882.patch
        against trunk revision 1049401.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        -1 findbugs. The patch appears to introduce 9 new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/70//testReport/
        Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/70//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/70//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12466505/ZOOKEEPER-882.patch against trunk revision 1049401. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 9 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/70//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/70//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/70//console This message is automatically generated.
        Hide
        Jared Cantwell added a comment -

        Also, somewhat related to this patch, I think the TxnIterator should implement java.util.Iterator, not its own strange version of it. A call to next() should return a TxnFileEntry which wraps the TxnHeader and Record. That's probably for another patch and a later version though, but just wanted to mention it.

        Show
        Jared Cantwell added a comment - Also, somewhat related to this patch, I think the TxnIterator should implement java.util.Iterator, not its own strange version of it. A call to next() should return a TxnFileEntry which wraps the TxnHeader and Record. That's probably for another patch and a later version though, but just wanted to mention it.
        Hide
        Flavio Junqueira added a comment -

        +1, looks good, Jared. On the Iterator issue you mention, would you mind creating a jira for it?

        Show
        Flavio Junqueira added a comment - +1, looks good, Jared. On the Iterator issue you mention, would you mind creating a jira for it?
        Hide
        Flavio Junqueira added a comment -

        Committed revision 1052244.

        Thanks again, Jared.

        Show
        Flavio Junqueira added a comment - Committed revision 1052244. Thanks again, Jared.
        Hide
        Hudson added a comment -

        Integrated in ZooKeeper-trunk #1046 (See https://hudson.apache.org/hudson/job/ZooKeeper-trunk/1046/)
        ZOOKEEPER-882. Startup loads last transaction from snapshot (jared via fpj)

        Show
        Hudson added a comment - Integrated in ZooKeeper-trunk #1046 (See https://hudson.apache.org/hudson/job/ZooKeeper-trunk/1046/ ) ZOOKEEPER-882 . Startup loads last transaction from snapshot (jared via fpj)

          People

          • Assignee:
            Jared Cantwell
            Reporter:
            Jared Cantwell
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development