ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-1090

Race condition while taking snapshot can lead to not restoring data tree correctly

    Details

    • Hadoop Flags:
      Reviewed

      Description

      I think I have found a bug in the snapshot mechanism.

      The problem occurs because dt.lastProcessedZxid is not synchronized (or rather set before the data tree is modified):

      FileTxnSnapLog:

          public void save(DataTree dataTree,
                  ConcurrentHashMap<Long, Integer> sessionsWithTimeouts)
              throws IOException {
              long lastZxid = dataTree.lastProcessedZxid;
              LOG.info("Snapshotting: " + Long.toHexString(lastZxid));
              File snapshot=new File(
                      snapDir, Util.makeSnapshotName(lastZxid));
              snapLog.serialize(dataTree, sessionsWithTimeouts, snapshot);   <=== the Datatree may not have the modification for lastProcessedZxid
          }
      

      DataTree:

          public ProcessTxnResult processTxn(TxnHeader header, Record txn) {
              ProcessTxnResult rc = new ProcessTxnResult();
      
              String debug = "";
              try {
                  rc.clientId = header.getClientId();
                  rc.cxid = header.getCxid();
                  rc.zxid = header.getZxid();
                  rc.type = header.getType();
                  rc.err = 0;
                  if (rc.zxid > lastProcessedZxid) {
                      lastProcessedZxid = rc.zxid;
                  }
                  [...modify data tree...]           
       }
      

      The lastProcessedZxid must be set after the modification is done.

      As a result, if server crashes after taking the snapshot (and the snapshot does not contain change corresponding to lastProcessedZxid) restore will not restore the data tree correctly:

      public long restore(DataTree dt, Map<Long, Integer> sessions,
                  PlayBackListener listener) throws IOException {
              snapLog.deserialize(dt, sessions);
              FileTxnLog txnLog = new FileTxnLog(dataDir);
              TxnIterator itr = txnLog.read(dt.lastProcessedZxid+1); <=== Assumes lastProcessedZxid is deserialized
       }
      

      I have had offline discussion with Ben and Camille on this. I will be posting the discussion shortly.

      1. ZOOKEEPER-1090
        7 kB
        Vishal Kher

        Activity

        Hide
        Hudson added a comment -

        Integrated in ZooKeeper-trunk #1258 (See https://builds.apache.org/job/ZooKeeper-trunk/1258/)
        ZOOKEEPER-1090. Race condition while taking snapshot can lead to not restoring data tree correctly.

        breed : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1151738
        Files :

        • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/DataTree.java
        • /zookeeper/trunk/src/java/test/org/apache/zookeeper/test/LoadFromLogTest.java
        • /zookeeper/trunk/CHANGES.txt
        Show
        Hudson added a comment - Integrated in ZooKeeper-trunk #1258 (See https://builds.apache.org/job/ZooKeeper-trunk/1258/ ) ZOOKEEPER-1090 . Race condition while taking snapshot can lead to not restoring data tree correctly. breed : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1151738 Files : /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/DataTree.java /zookeeper/trunk/src/java/test/org/apache/zookeeper/test/LoadFromLogTest.java /zookeeper/trunk/CHANGES.txt
        Hide
        Vishal Kher added a comment -

        Ben, Camille, Thanks for the review. I will keep the comments in mind for next time.

        What do you think about the problem that I mentioned. It is cerainly very rare. In general, do you think we should modify lastProcssedZxid only if the modification to data tree succeeds? (there will be other problems will need to solve along with this, e.g., if x doesn't succeed should we apply x + 1, etc.)

        Show
        Vishal Kher added a comment - Ben, Camille, Thanks for the review. I will keep the comments in mind for next time. What do you think about the problem that I mentioned. It is cerainly very rare. In general, do you think we should modify lastProcssedZxid only if the modification to data tree succeeds? (there will be other problems will need to solve along with this, e.g., if x doesn't succeed should we apply x + 1, etc.)
        Hide
        Benjamin Reed added a comment -

        Committed revision 1151738.

        Show
        Benjamin Reed added a comment - Committed revision 1151738.
        Hide
        Camille Fournier added a comment -

        Ben, can you check this one in? Thanks!

        Show
        Camille Fournier added a comment - Ben, can you check this one in? Thanks!
        Hide
        Benjamin Reed added a comment -

        oh, i also wanted to second camille's comment about Assert. Also, instead of Assert.assertTrue("message", false) you can use Assert.fail("message")

        Show
        Benjamin Reed added a comment - oh, i also wanted to second camille's comment about Assert. Also, instead of Assert.assertTrue("message", false) you can use Assert.fail("message")
        Hide
        Benjamin Reed added a comment -

        +1 great find and fix vishal

        Show
        Benjamin Reed added a comment - +1 great find and fix vishal
        Hide
        Vishal Kher added a comment -

        We could run into this if the JVM is running low on memory (runtime execption) while modifying the tree. It is a very rare case. We don't need to fix it right away. It sounds like if modification to the data tree for x fails due to runtime exception (and not due to exceptions like NoNode), then before applying x+1 to the tree we should attempt to apply x first. We should change lastProcessedZxid only if modifiction to the tree succeeds.

        Show
        Vishal Kher added a comment - We could run into this if the JVM is running low on memory (runtime execption) while modifying the tree. It is a very rare case. We don't need to fix it right away. It sounds like if modification to the data tree for x fails due to runtime exception (and not due to exceptions like NoNode), then before applying x+1 to the tree we should attempt to apply x first. We should change lastProcessedZxid only if modifiction to the tree succeeds.
        Hide
        Camille Fournier added a comment -

        I don't have a test that that will fail without the patch.

        Ok, I see. So the test verifies that a snapshot with a later zxid than the named zxid processes correctly, and it does. That's good.

        I think this is all fine. The only comment I would make is that for future reference, you should use Assert.assertEquals instead of Assert.assertTrue when asserting equality. The Assert will take care of the work of printing out "expected foo but was bar" for you in that case, so you don't need it in the message of the Assert statement. Just a bit easier for you to write and others to read.

        After going through the code again, I think the following hypothetical

        failure scenario could be a problem: For some reason, modification to
        data tree fails for transactions x to x-n on one of the peer. The
        lastProcessedZxid will still be set to x. If the peer restores after
        snapshot, then it will not have transactions x-n to x in its data
        tree. Now, if the modification had failed due to transient errors (say
        out of memory), then we will not apply transactions x-n to x even if
        we should have. The only other reason we could run into this scenario
        (apart from runtime failures) is if we had a bug in DataTree
        modification code path. This looks like a rare corner case.

        Hmm. So if we had a bug in DataTree, we could lose transactions, yes? Well, I think it's fair to say that in general, if we have bugs we might lose data. Probably worth keeping in mind but not losing sleep over at this time.

        Show
        Camille Fournier added a comment - I don't have a test that that will fail without the patch. Ok, I see. So the test verifies that a snapshot with a later zxid than the named zxid processes correctly, and it does. That's good. I think this is all fine. The only comment I would make is that for future reference, you should use Assert.assertEquals instead of Assert.assertTrue when asserting equality. The Assert will take care of the work of printing out "expected foo but was bar" for you in that case, so you don't need it in the message of the Assert statement. Just a bit easier for you to write and others to read. After going through the code again, I think the following hypothetical failure scenario could be a problem: For some reason, modification to data tree fails for transactions x to x-n on one of the peer. The lastProcessedZxid will still be set to x. If the peer restores after snapshot, then it will not have transactions x-n to x in its data tree. Now, if the modification had failed due to transient errors (say out of memory), then we will not apply transactions x-n to x even if we should have. The only other reason we could run into this scenario (apart from runtime failures) is if we had a bug in DataTree modification code path. This looks like a rare corner case. Hmm. So if we had a bug in DataTree, we could lose transactions, yes? Well, I think it's fair to say that in general, if we have bugs we might lose data. Probably worth keeping in mind but not losing sleep over at this time.
        Hide
        Vishal Kher added a comment -

        I don't have a test that that will fail without the patch.
        Earlier, we had a problem because snapshot.x did not contain x. Now,
        we cannot run into that scenario because we first apply x and then
        increment lastProcessedZxid.

        with the patch we can only have a scenario where we have applied x in
        snapshot.x-1. Note - a snapshot can always contain data > than
        lastProcessedZxid so the patch is not introducing a new scenario. So I
        wrote a test to verify correctness of this scenario. Any suggestions?

        Can you elaborate on our comment:

        rather set before the data tree is modified):
        Datatree may not have the modification for lastProcessedZxid
        snapshot does not contain change corresponding to lastProcessedZxid) restore
        will not restore the data tree correctly:
        lastProcessedZxid is deserialized
        posting the discussion shortly.

        After going through the code again, I think the following hypothetical
        failure scenario could be a problem: For some reason, modification to
        data tree fails for transactions x to x-n on one of the peer. The
        lastProcessedZxid will still be set to x. If the peer restores after
        snapshot, then it will not have transactions x-n to x in its data
        tree. Now, if the modification had failed due to transient errors (say
        out of memory), then we will not apply transactions x-n to x even if
        we should have. The only other reason we could run into this scenario
        (apart from runtime failures) is if we had a bug in DataTree
        modification code path. This looks like a rare corner case.

        Note - the above problem is not an effect of the patch. it is present
        in current implementation as well.

        Show
        Vishal Kher added a comment - I don't have a test that that will fail without the patch. Earlier, we had a problem because snapshot.x did not contain x. Now, we cannot run into that scenario because we first apply x and then increment lastProcessedZxid. with the patch we can only have a scenario where we have applied x in snapshot.x-1. Note - a snapshot can always contain data > than lastProcessedZxid so the patch is not introducing a new scenario. So I wrote a test to verify correctness of this scenario. Any suggestions? Can you elaborate on our comment: rather set before the data tree is modified): Datatree may not have the modification for lastProcessedZxid snapshot does not contain change corresponding to lastProcessedZxid) restore will not restore the data tree correctly: lastProcessedZxid is deserialized posting the discussion shortly. After going through the code again, I think the following hypothetical failure scenario could be a problem: For some reason, modification to data tree fails for transactions x to x-n on one of the peer. The lastProcessedZxid will still be set to x. If the peer restores after snapshot, then it will not have transactions x-n to x in its data tree. Now, if the modification had failed due to transient errors (say out of memory), then we will not apply transactions x-n to x even if we should have. The only other reason we could run into this scenario (apart from runtime failures) is if we had a bug in DataTree modification code path. This looks like a rare corner case. Note - the above problem is not an effect of the patch. it is present in current implementation as well.
        Hide
        Camille Fournier added a comment -

        Vishal, when I run the new test with or without the patch on DataTree, it succeeds. Do we expect this to fail reliably without the patch?

        Show
        Camille Fournier added a comment - Vishal, when I run the new test with or without the patch on DataTree, it succeeds. Do we expect this to fail reliably without the patch?
        Hide
        Camille Fournier added a comment -

        On vacation will try to check this pm.
        https://issues.apache.org/jira/browse/ZOOKEEPER-1090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13070788#comment-13070788]
        correctly
        ----------------------------------------------------------------------------------
        rather set before the data tree is modified):
        Datatree may not have the modification for lastProcessedZxid
        snapshot does not contain change corresponding to lastProcessedZxid) restore
        will not restore the data tree correctly:
        lastProcessedZxid is deserialized
        posting the discussion shortly.

        Show
        Camille Fournier added a comment - On vacation will try to check this pm. https://issues.apache.org/jira/browse/ZOOKEEPER-1090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13070788#comment-13070788 ] correctly ---------------------------------------------------------------------------------- rather set before the data tree is modified): Datatree may not have the modification for lastProcessedZxid snapshot does not contain change corresponding to lastProcessedZxid) restore will not restore the data tree correctly: lastProcessedZxid is deserialized posting the discussion shortly.
        Hide
        Mahadev konar added a comment -

        Camille/Ben,
        Any one of you mind reviewing this? Would be great if you both can!

        Show
        Mahadev konar added a comment - Camille/Ben, Any one of you mind reviewing this? Would be great if you both can!
        Hide
        Vishal Kher added a comment -

        Ben/Camille can you please review the patch? Thanks.

        Show
        Vishal Kher added a comment - Ben/Camille can you please review the patch? Thanks.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12486671/ZOOKEEPER-1090
        against trunk revision 1146961.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 6 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/397//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/397//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/397//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12486671/ZOOKEEPER-1090 against trunk revision 1146961. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/397//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/397//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/397//console This message is automatically generated.
        Hide
        Vishal Kher added a comment -

        patch available for trunk

        Show
        Vishal Kher added a comment - patch available for trunk
        Hide
        Vishal Kher added a comment -

        I am attaching a patch that I think fixes this bug.

        The fix is really simple - change lastProcessedZxid after modifying data tree.

        An alternative approach would be to set zxid while taking snapshot to lastProcessedZxid -1. However, this is error prone since the caller has to remember to decrement the zxid before taking snapshot. I prefer the fix in the patch.

        Show
        Vishal Kher added a comment - I am attaching a patch that I think fixes this bug. The fix is really simple - change lastProcessedZxid after modifying data tree. An alternative approach would be to set zxid while taking snapshot to lastProcessedZxid -1. However, this is error prone since the caller has to remember to decrement the zxid before taking snapshot. I prefer the fix in the patch.
        Hide
        Vishal Kher added a comment -

        I am working on a fix.

        Show
        Vishal Kher added a comment - I am working on a fix.
        Hide
        Mahadev konar added a comment -

        Vishal,
        Since you already fixed ZOOKEEPER-1046, do you want to take a crack at this?

        Show
        Mahadev konar added a comment - Vishal, Since you already fixed ZOOKEEPER-1046 , do you want to take a crack at this?
        Hide
        Vishal Kher added a comment -

        Sorry for the delay. I have a bit of a catchup to do w.r.t what went in for ZOOKEEPER-1046. In general, I think this should be targeted for 3.4. From what I remember, the fix is quite straight forward. I can take this up after I fully understand the fix for ZOOKEEPER-1046.

        Show
        Vishal Kher added a comment - Sorry for the delay. I have a bit of a catchup to do w.r.t what went in for ZOOKEEPER-1046 . In general, I think this should be targeted for 3.4. From what I remember, the fix is quite straight forward. I can take this up after I fully understand the fix for ZOOKEEPER-1046 .
        Hide
        Mahadev konar added a comment -

        Vishal/Camille,
        Should this be a target for 3.4 release?

        Show
        Mahadev konar added a comment - Vishal/Camille, Should this be a target for 3.4 release?
        Hide
        Vishal Kher added a comment -

        Vishal Kher to Benjamin, Camille 9:38 AM (54 minutes ago)

        Hi Ben,

        I was going to open a jira after checking with you guys.

        With your solution won't we run in into the problem that Camille pointed out by marking the snapshot for lastZxid - 1? - Peer fails after taking the snapshot. The snapshot has applied lastZxid, which created a child znode. During recovery it will apply lastZxid, which will faill with NodeExists and we will increment the Cversion of the parent (after fix for ZOOKEEPER-1046). So creation of sequential znodes won't be be really sequential. This is probably acceptable, but should be clearly documented. If not, then from this bug, it looks to me that patch ZOOKEEPER-1046 was also not complete. We may need to conditionally update cversion.

        I think we need a clear document of how snapshots are supposed to work and what to assume while taking snapshots and while recovering from snapshots. It will be very helpful for to verify the correctness of implementation and for future bug fixes.I will open a jira for this bug shortly and post our discussion as well.

        Thanks.

        -Vishal

        Show
        Vishal Kher added a comment - Vishal Kher to Benjamin, Camille 9:38 AM (54 minutes ago) Hi Ben, I was going to open a jira after checking with you guys. With your solution won't we run in into the problem that Camille pointed out by marking the snapshot for lastZxid - 1? - Peer fails after taking the snapshot. The snapshot has applied lastZxid, which created a child znode. During recovery it will apply lastZxid, which will faill with NodeExists and we will increment the Cversion of the parent (after fix for ZOOKEEPER-1046 ). So creation of sequential znodes won't be be really sequential. This is probably acceptable, but should be clearly documented. If not, then from this bug, it looks to me that patch ZOOKEEPER-1046 was also not complete. We may need to conditionally update cversion. I think we need a clear document of how snapshots are supposed to work and what to assume while taking snapshots and while recovering from snapshots. It will be very helpful for to verify the correctness of implementation and for future bug fixes.I will open a jira for this bug shortly and post our discussion as well. Thanks. -Vishal
        Hide
        Vishal Kher added a comment -

        On Thu, Jun 9, 2011 at 1:25 AM, Benjamin Reed wrote:

        you should open a jira on this. (we should also be have this conversation in the context of that jira i'm trying to think through the possible problems. while it is true that lastZxid hasn't been applied before it is set, it may not be in the snapshot. it will be in the txnlog, so it will be recovered if a failure happens. perhaps the simplest thing would be to mark the snapshot as lastZxid-1.

        the test case would be to suspend the datatree.processTxn right after lastZxid is set, and then take a snapshot followed by a crash. right?

        we should make sure that we are handling it correctly, and we should also make sure to add some comments to the code to explain why we are handling it correctly

        ben

        Show
        Vishal Kher added a comment - On Thu, Jun 9, 2011 at 1:25 AM, Benjamin Reed wrote: you should open a jira on this. (we should also be have this conversation in the context of that jira i'm trying to think through the possible problems. while it is true that lastZxid hasn't been applied before it is set, it may not be in the snapshot. it will be in the txnlog, so it will be recovered if a failure happens. perhaps the simplest thing would be to mark the snapshot as lastZxid-1. the test case would be to suspend the datatree.processTxn right after lastZxid is set, and then take a snapshot followed by a crash. right? we should make sure that we are handling it correctly, and we should also make sure to add some comments to the code to explain why we are handling it correctly ben
        Hide
        Vishal Kher added a comment -

        yeah, you are right. I think we should make update of datatree and lastProcessedzxid look atomic.

        On Wed, Jun 8, 2011 at 2:14 PM, Fournier, Camille F. [Tech] wrote:

        So we might over increment a parent cvid... What does that mean for a client? Maybe missing a seq id node number? Seems ok but should probably document.

        C

        From: Vishal Kher
        To: Fournier, Camille F. [Tech]
        Cc: Benjamin Reed
        Sent: Wed Jun 08 13:53:59 2011
        Subject: Re: Race condition while taking snapshot

        This will result in datatree having latest data but lastProcessedZxid not up-to-date. So when you take snapshot.x, the file may contain data for x + 1. But I think this is ok (and this is how snpashot currently works) because during restore we will start applying transactions from x + 1. Some of them will fail (e.,g., create/delete) and we will increment only the zxid on failure.

        I have not analyzed this completely yet. But I think it will be ok.

        Thanks.
        -Vishal

        On Wed, Jun 8, 2011 at 1:33 PM, Fournier, Camille F. [Tech] wrote:

        Yes I think you are right. What do we think the implications of moving the setting of lastProcessedZxid to below the processing of the txn? Looks like not a big problem but I haven’t the mental bandwidth to think it through at the moment.

        C

        Show
        Vishal Kher added a comment - yeah, you are right. I think we should make update of datatree and lastProcessedzxid look atomic. On Wed, Jun 8, 2011 at 2:14 PM, Fournier, Camille F. [Tech] wrote: So we might over increment a parent cvid... What does that mean for a client? Maybe missing a seq id node number? Seems ok but should probably document. C From: Vishal Kher To: Fournier, Camille F. [Tech] Cc: Benjamin Reed Sent: Wed Jun 08 13:53:59 2011 Subject: Re: Race condition while taking snapshot This will result in datatree having latest data but lastProcessedZxid not up-to-date. So when you take snapshot.x, the file may contain data for x + 1. But I think this is ok (and this is how snpashot currently works) because during restore we will start applying transactions from x + 1. Some of them will fail (e.,g., create/delete) and we will increment only the zxid on failure. I have not analyzed this completely yet. But I think it will be ok. Thanks. -Vishal On Wed, Jun 8, 2011 at 1:33 PM, Fournier, Camille F. [Tech] wrote: Yes I think you are right. What do we think the implications of moving the setting of lastProcessedZxid to below the processing of the txn? Looks like not a big problem but I haven’t the mental bandwidth to think it through at the moment. C

          People

          • Assignee:
            Vishal Kher
            Reporter:
            Vishal Kher
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development