ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-1046

Creating a new sequential node results in a ZNODEEXISTS error

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 3.3.2, 3.3.3
    • Fix Version/s: 3.3.4, 3.4.0
    • Component/s: server
    • Labels:
    • Environment:

      A 3 node-cluster running Debian squeeze.

    • Tags:
      sequential znodeexists

      Description

      On several occasions, I've seen a create() with the sequential flag set fail with a ZNODEEXISTS error, and I don't think that should ever be possible. In past runs, I've been able to closely inspect the state of the system with the command line client, and saw that the parent znode's cversion is smaller than the sequential number of existing children znode under that parent. In one example:

      [zk:<ip:port>(CONNECTED) 3] stat /zkrsm
      cZxid = 0x5
      ctime = Mon Jan 17 18:28:19 PST 2011
      mZxid = 0x5
      mtime = Mon Jan 17 18:28:19 PST 2011
      pZxid = 0x1d819
      cversion = 120710
      dataVersion = 0
      aclVersion = 0
      ephemeralOwner = 0x0
      dataLength = 0
      numChildren = 2955
      

      However, the znode /zkrsm/000000000000002d_record0000120804 existed on disk.

      In a recent run, I was able to capture the Zookeeper logs, and I will attach them to this JIRA. The logs are named as nodeX.<zxid_prefixes>.log, and each new log represents an application process restart.

      Here's the scenario:

      1. There's a cluster with nodes 1,2,3 using zxid 0x3.
      2. All three nodes restart, forming a cluster of zxid 0x4.
      3. Node 3 restarts, leading to a cluster of 0x5.

      At this point, it seems like node 1 is the leader of the 0x5 epoch. In its log (node1.0x4-0x5.log) you can see the first (of many) instances of the following message:

      2011-04-11 21:16:12,607 16649 [ProcessThread:-1] INFO org.apache.zookeeper.server.PrepRequestProcessor  - Got user-level KeeperException when processing sessionid:0x512f466bd44e0002 type:create cxid:0x4da376ab zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a Error Path:/zkrsm/00000000000000b2_record0001761440 Error:KeeperErrorCode = NodeExists for /zkrsm/00000000000000b2_record0001761440
      

      This then repeats forever as my application isn't expecting to ever get this error message on a sequential node create, and just continually retries. The message even transfers over to node3.0x5-0x6.log once the 0x6 epoch comes into play.

      I don't see anything terribly fishy in the transition between the epochs; the correct snapshots seem to be getting transferred, etc. Unfortunately I don't have a ZK snapshot/log that exhibits the problem when starting with a fresh system.

      Some oddities you might notice in these logs:

      • Between epochs 0x3 and 0x4, the zookeeper IDs of the nodes changed due to a bug in our application code. (They are assigned randomly, but are supposed to be consistent across restarts.)
      • We manage node membership dynamically, and our application restarts the ZooKeeperServer classes whenever a new node wants to join (without restarting the entire application process). This is why you'll see messages like the following in node1.0x4-0x5.log before a new election begins:
        2011-04-11 21:16:00,762 4804 [QuorumPeer:/0.0.0.0:2888] INFO org.apache.zookeeper.server.quorum.Learner  - shutdown called
        
      • There is in fact one of these dynamic membership changes in node1.0x4-0x5.log, just before the 0x4 epoch is formed. I'm not sure how this would be related though, as no transactions are done during this period.
      1. ZOOKEEPER-1046_2.patch
        26 kB
        Benjamin Reed
      2. ZOOKEEPER-1046.patch
        8 kB
        Vishal Kher
      3. ZOOKEEPER-1046.patch
        1 kB
        Jeremy Stribling
      4. ZOOKEEPER-1046.patch1
        13 kB
        Vishal Kher
      5. ZOOKEEPER-1046.tgz
        4.05 MB
        Jeremy Stribling
      6. zookeeper-1046-3
        25 kB
        Camille Fournier
      7. zookeeper-1046-4.patch
        26 kB
        Camille Fournier
      8. zookeeper-1046-5.patch
        28 kB
        Camille Fournier
      9. ZOOKEEPER-1046-6.patch
        8 kB
        Vishal Kher
      10. ZOOKEEPER-1046-for333
        13 kB
        Camille Fournier

        Activity

        Hide
        Jeremy Stribling added a comment -

        The logs are too big to attach, but you can download them from http://pdos.csail.mit.edu/~strib/zk_node_exists.tgz [24 MB, 670 MB unzipped].

        Show
        Jeremy Stribling added a comment - The logs are too big to attach, but you can download them from http://pdos.csail.mit.edu/~strib/zk_node_exists.tgz [24 MB, 670 MB unzipped] .
        Hide
        Camille Fournier added a comment -

        You saw this problem in the past in a single node setup, correct?

        Show
        Camille Fournier added a comment - You saw this problem in the past in a single node setup, correct?
        Hide
        Jeremy Stribling added a comment -

        Correct. Sorry, forgot to mention that.

        Show
        Jeremy Stribling added a comment - Correct. Sorry, forgot to mention that.
        Hide
        Camille Fournier added a comment -

        So, from a first pass through the code, I'm not sure how this is happening. It looks like the parent cversion is always updated when you add a child node, and that is reflected in both the data tree that would be serialized and the transaction log replay (which itself would update the parent node cversion). It would be nice if we could get the log. and snapshot. files from the zk data directories the next time you see this. Can you force it to happen more often by updating your cluster more frequently or bouncing nodes?

        Show
        Camille Fournier added a comment - So, from a first pass through the code, I'm not sure how this is happening. It looks like the parent cversion is always updated when you add a child node, and that is reflected in both the data tree that would be serialized and the transaction log replay (which itself would update the parent node cversion). It would be nice if we could get the log. and snapshot. files from the zk data directories the next time you see this. Can you force it to happen more often by updating your cluster more frequently or bouncing nodes?
        Hide
        Jeremy Stribling added a comment -

        Thanks for looking into. I will grab all the data I can the next time I see this happen, but it's difficult since I've never had it happen on my own setup, only when others report it to me, and by then the data may have been rotated away.

        I haven't found a test to reproduce this yet, but I will try to come up with one when I get some time.

        Show
        Jeremy Stribling added a comment - Thanks for looking into. I will grab all the data I can the next time I see this happen, but it's difficult since I've never had it happen on my own setup, only when others report it to me, and by then the data may have been rotated away. I haven't found a test to reproduce this yet, but I will try to come up with one when I get some time.
        Hide
        Jeremy Stribling added a comment -

        The log and snapshot files for an occurrence of this with a single node using ZK 3.3.2.

        Show
        Jeremy Stribling added a comment - The log and snapshot files for an occurrence of this with a single node using ZK 3.3.2.
        Hide
        Jeremy Stribling added a comment -

        I was wrong – I was able to find the log/snapshot files from a run with Zookeeper 3.3.2 (single node). If you start up Zookeeper 3.3.2 with this as your data dir, you can reproduce the problem with the following command in zkCli:

        # create -s /zkrsm/000000000000002d_record test
        Node already exists: /zkrsm/000000000000002d_record
        

        Unfortunately, I do not have logs for the run that led to this state, I only have logs from after the restart, which shouldn't give you any more info than just running with these snapshots yourself.

        Show
        Jeremy Stribling added a comment - I was wrong – I was able to find the log/snapshot files from a run with Zookeeper 3.3.2 (single node). If you start up Zookeeper 3.3.2 with this as your data dir, you can reproduce the problem with the following command in zkCli: # create -s /zkrsm/000000000000002d_record test Node already exists: /zkrsm/000000000000002d_record Unfortunately, I do not have logs for the run that led to this state, I only have logs from after the restart, which shouldn't give you any more info than just running with these snapshots yourself.
        Hide
        Jeremy Stribling added a comment -

        For the record, I have another example of this happening, this time without a lot complicated cluster dynamics. The scenario is:

        1. There is a 3-node cluster
        2. A sequential node is created with seq. num. X
        3. The leader dies very shortly after that (within a couple of transactions)
        4. A new leader is elected (who has seen all of the transactions committed by the previous leader)
        5. The new leader immediately hits the problem where it returns ZNODEEXISTS when a seq. node is returned

        I have logs from before and after the problem, as well as Zookeeper snapshots that contain the problem, but unfortunately the transaction log containing the actual problematic transaction was rotated away. If anyone wants to see what I do have, let me know and I'll upload them somewhere.

        I will also try to write a simple test that creates lots of sequential nodes, kills the leader, and sees if the problem occurs.

        Show
        Jeremy Stribling added a comment - For the record, I have another example of this happening, this time without a lot complicated cluster dynamics. The scenario is: There is a 3-node cluster A sequential node is created with seq. num. X The leader dies very shortly after that (within a couple of transactions) A new leader is elected (who has seen all of the transactions committed by the previous leader) The new leader immediately hits the problem where it returns ZNODEEXISTS when a seq. node is returned I have logs from before and after the problem, as well as Zookeeper snapshots that contain the problem, but unfortunately the transaction log containing the actual problematic transaction was rotated away. If anyone wants to see what I do have, let me know and I'll upload them somewhere. I will also try to write a simple test that creates lots of sequential nodes, kills the leader, and sees if the problem occurs.
        Hide
        Camille Fournier added a comment -

        Thanks Jeremy.
        Today I ran the single node test with your log above and got even stranger results; instead of giving me NODEEXISTS errors I was simply overwriting nodes. This was through a java ZooKeeper instead of zkCli. Going to look into it further now.

        Show
        Camille Fournier added a comment - Thanks Jeremy. Today I ran the single node test with your log above and got even stranger results; instead of giving me NODEEXISTS errors I was simply overwriting nodes. This was through a java ZooKeeper instead of zkCli. Going to look into it further now.
        Hide
        Andrei Savu added a comment -

        I've seen a similar behavior while writing a strictly increasing sequence of numbers as sequential nodes. You can find the code at the following link [1]. The repository also contains some basic tools for fault injection.

        [1] https://github.com/andreisavu/zookeeper-mq

        Show
        Andrei Savu added a comment - I've seen a similar behavior while writing a strictly increasing sequence of numbers as sequential nodes. You can find the code at the following link [1] . The repository also contains some basic tools for fault injection. [1] https://github.com/andreisavu/zookeeper-mq
        Hide
        Camille Fournier added a comment -

        This one is a bear. I believe what is happening is the following:

        Your code is creating and deleting large numbers of sequential nodes. At time T, it is in the process of deleting a bunch of nodes when ZK decides to take a snapshot of the state.

        When we take a snapshot, we spawn a separate thread and serialize the nodes of the tree in that thread. We get into your /zkrsm node in DataTree.serializeNode, get that node from the tree, synchronize on it, and write out the record of that node including its current cversion (used to generate sequential node information) and the list of children. However, we then release the sync on that node, and attempt to iterate through the children to serialize them out. In the meantime, the other thread is merrily deleting children of this node, increasing the cversion of /zkrsm all the while. So the list of children that we got while serializing the parent is defunct. When we try to serialize these now-deleted children, we see that they are null and continue on.

        Now, you finish this snapshot, delete some more nodes under /zkrsm, create some more sequential nodes under /zkrsm, and crash. When you start back up again, you read that snapshot and start playing through the log transactions after the snapshot zxid. Unfotunately, the first N transactions in your log after the snapshot zxid are deletions of nodes that didn't make it into the snapshot because you deleted them before they could be serialized to the snapshot. We will try to process the delete transaction and get a NoNodeException, but ignore it because we know that can happen due to what I wrote above. But what we don't do is increment the cversion of the parent node after this failed deletion. So our parent's cversion is less than the version it would be if you played just the transaction log through, or of the system before the crash. Now you want to continue creating sequential nodes where you left off, but your cversion is wrong so you try to create a node that already exists. Whoops.

        So, now we just need to fix it. Should we be incrementing the cversion of the parent even on a NoNode exception during txn log replay? I suspect that is the right thing to do. Thoughts?

        Show
        Camille Fournier added a comment - This one is a bear. I believe what is happening is the following: Your code is creating and deleting large numbers of sequential nodes. At time T, it is in the process of deleting a bunch of nodes when ZK decides to take a snapshot of the state. When we take a snapshot, we spawn a separate thread and serialize the nodes of the tree in that thread. We get into your /zkrsm node in DataTree.serializeNode, get that node from the tree, synchronize on it, and write out the record of that node including its current cversion (used to generate sequential node information) and the list of children. However, we then release the sync on that node, and attempt to iterate through the children to serialize them out. In the meantime, the other thread is merrily deleting children of this node, increasing the cversion of /zkrsm all the while. So the list of children that we got while serializing the parent is defunct. When we try to serialize these now-deleted children, we see that they are null and continue on. Now, you finish this snapshot, delete some more nodes under /zkrsm, create some more sequential nodes under /zkrsm, and crash. When you start back up again, you read that snapshot and start playing through the log transactions after the snapshot zxid. Unfotunately, the first N transactions in your log after the snapshot zxid are deletions of nodes that didn't make it into the snapshot because you deleted them before they could be serialized to the snapshot. We will try to process the delete transaction and get a NoNodeException, but ignore it because we know that can happen due to what I wrote above. But what we don't do is increment the cversion of the parent node after this failed deletion. So our parent's cversion is less than the version it would be if you played just the transaction log through, or of the system before the crash. Now you want to continue creating sequential nodes where you left off, but your cversion is wrong so you try to create a node that already exists. Whoops. So, now we just need to fix it. Should we be incrementing the cversion of the parent even on a NoNode exception during txn log replay? I suspect that is the right thing to do. Thoughts?
        Hide
        Jeremy Stribling added a comment -

        Wow, nice sleuthing. Sounds very plausible – we are doing a lot of deletions under that same node, and it could definitely be concurrent with the node creation. But not knowing the code very well, I'll defer to others for confirmation.

        I will try to implement this change locally for our app and see if we run into the problem again. I assume you're talking about the OpCode.delete case in PrepRequestProcessor::pRequest? And copying the cversion increment into a catch clause after the getRecordForPath(path) call?

        Show
        Jeremy Stribling added a comment - Wow, nice sleuthing. Sounds very plausible – we are doing a lot of deletions under that same node, and it could definitely be concurrent with the node creation. But not knowing the code very well, I'll defer to others for confirmation. I will try to implement this change locally for our app and see if we run into the problem again. I assume you're talking about the OpCode.delete case in PrepRequestProcessor::pRequest? And copying the cversion increment into a catch clause after the getRecordForPath(path) call?
        Hide
        Camille Fournier added a comment -

        No, it'll need to be fixed in the DataTree deleteNode method somehow. I'm not really sure the best way to do it yet. But an easy manual test to run is, with the log and snapshot files attached to this ticket, if you start up a zk with the version-2 directory containing just log.1, the cversion of /zkrsm is 84340. If you start it up with log.1 and snapshot.1460d, the cversion of /zkrsm is 84245. If you fix this bug, both should show cversion of 84340.

        Show
        Camille Fournier added a comment - No, it'll need to be fixed in the DataTree deleteNode method somehow. I'm not really sure the best way to do it yet. But an easy manual test to run is, with the log and snapshot files attached to this ticket, if you start up a zk with the version-2 directory containing just log.1, the cversion of /zkrsm is 84340. If you start it up with log.1 and snapshot.1460d, the cversion of /zkrsm is 84245. If you fix this bug, both should show cversion of 84340.
        Hide
        Jeremy Stribling added a comment -

        A quick and dirty patch that seems to fix the problem. Applies from src/ with `patch -p1`.

        Show
        Jeremy Stribling added a comment - A quick and dirty patch that seems to fix the problem. Applies from src/ with `patch -p1`.
        Hide
        Jeremy Stribling added a comment -

        Thanks Camille. The patch I just attached seems to solve the problem for the log/snap combo you point out above. I haven't tested it extensively yet, and it's not very elegant so you probably want to rewrite it and I'm not going to put this into "patch available" yet. But if you have a chance, please let me know if there's anything obviously wrong in this patch.

        Show
        Jeremy Stribling added a comment - Thanks Camille. The patch I just attached seems to solve the problem for the log/snap combo you point out above. I haven't tested it extensively yet, and it's not very elegant so you probably want to rewrite it and I'm not going to put this into "patch available" yet. But if you have a chance, please let me know if there's anything obviously wrong in this patch.
        Hide
        Jeremy Stribling added a comment -

        (By the way, the patch is against the 3.3.3 release, not the trunk.)

        Show
        Jeremy Stribling added a comment - (By the way, the patch is against the 3.3.3 release, not the trunk.)
        Hide
        Vishal Kher added a comment -

        Hi Jeremy, Camille,

        The patch that Jeremy attached will fix the problem, but it will result in incrementing cversion() for every delete operation that fails with NoNodeException. That is, an attempt to delete a node that does not exist will also increment the cversion. Am I correct?

        If yes, then do you think it would be better to change

        public void deleteNode(String path, long zxid)

        to

        public void deleteNode(String path, long zxid, boolean forceIncrementCversion)

        .

        If forceIncrementCversion is set, the code will work similar to Jeremy's patch. ZkDataBase.processTxn() will always set the flag to false. FileTxnSnapLog.processTransaction() will always set the flag to true. We will also need to add third argument to DataTree.processTxn().

        Questions for Camille:

        • Why does this have to be in DataTree.deleteNode() and cannot be part of exception handling at the higher layer? Isn't this a special case for Txn exception handling rather than deleting the node from DT?
        • You mentioned that your test resulted in overwriting the nodes instead of causing NODEXIST. Is this because of the same bug?

        Thanks.

        Show
        Vishal Kher added a comment - Hi Jeremy, Camille, The patch that Jeremy attached will fix the problem, but it will result in incrementing cversion() for every delete operation that fails with NoNodeException. That is, an attempt to delete a node that does not exist will also increment the cversion. Am I correct? If yes, then do you think it would be better to change public void deleteNode(String path, long zxid) to public void deleteNode(String path, long zxid, boolean forceIncrementCversion) . If forceIncrementCversion is set, the code will work similar to Jeremy's patch. ZkDataBase.processTxn() will always set the flag to false. FileTxnSnapLog.processTransaction() will always set the flag to true. We will also need to add third argument to DataTree.processTxn(). Questions for Camille: Why does this have to be in DataTree.deleteNode() and cannot be part of exception handling at the higher layer? Isn't this a special case for Txn exception handling rather than deleting the node from DT? You mentioned that your test resulted in overwriting the nodes instead of causing NODEXIST. Is this because of the same bug? Thanks.
        Hide
        Camille Fournier added a comment -

        I think it could and possibly should be in the exception handling at the higher layer. I haven't had much of a chance to look through it in details yet. I think there's definitely still some work to be done here, if you have a better idea it would be cool to see it.

        Which test are you referring to? I realized after posting that one of my tests was writing to a different node name entirely instead of overwriting a node name so that was a human operator problem.

        Show
        Camille Fournier added a comment - I think it could and possibly should be in the exception handling at the higher layer. I haven't had much of a chance to look through it in details yet. I think there's definitely still some work to be done here, if you have a better idea it would be cool to see it. Which test are you referring to? I realized after posting that one of my tests was writing to a different node name entirely instead of overwriting a node name so that was a human operator problem.
        Hide
        Vishal Kher added a comment -

        Hi Camille, Jeremy,

        I have attached a patch that I think should fix the problem that Camille
        pointed out.

        There are subtle effects of this patch that needs to be carefully
        reviewed:

        1. Previously, DataTree.processTxn() did not return error if
        create/delete failed with nodeexist/nonode exception (the if condition
        was a bit strange also). I think we should not do this. Per my
        understanding, such a failure can occur only when we are replaying the
        transactions. It should not occur when the server is doing live
        transactions (unless there is a bug). So processTxn should return
        error even in this case to the caller. The caller should decided
        whether to ignore these errors or not. In FileTxnSnapLog we handle the
        error by incrementing cversion of the parent. In FinalRequestProcessor
        we do not do anything special, but we do not ignore the error as well.

        2. We could increment cversion in DeleteNode itself (as I mentioned in
        my previous comment). But I think changes in the attached patch is a
        better approach. I would like to confirm whether not incrementing
        cversion in DeleteNode() is going to introduce any synchronization
        issue (since the lock on the parent will be released and then
        reacquired). I don't think there is any problem here, but it will be
        good if you can confirm this.

        I noticed that the same bug can occur while creating a node as
        well. Patch handles create and delete.

        I was able to test this patch using the snapshots and logs that Jermey
        had attached. I had to revert the logs to the state where this bug
        would be seen the first time. I ended up doing: rm *.1d8* *.1d9*

        Then I can see that "create -s /zkrsm/000000000000002d_record test"
        succeeds with this patch (and fails without).

        I have not thought about writing a test for this yet (sounds complex to me).

        Thanks.
        -Vishal

        Show
        Vishal Kher added a comment - Hi Camille, Jeremy, I have attached a patch that I think should fix the problem that Camille pointed out. There are subtle effects of this patch that needs to be carefully reviewed: 1. Previously, DataTree.processTxn() did not return error if create/delete failed with nodeexist/nonode exception (the if condition was a bit strange also). I think we should not do this. Per my understanding, such a failure can occur only when we are replaying the transactions. It should not occur when the server is doing live transactions (unless there is a bug). So processTxn should return error even in this case to the caller. The caller should decided whether to ignore these errors or not. In FileTxnSnapLog we handle the error by incrementing cversion of the parent. In FinalRequestProcessor we do not do anything special, but we do not ignore the error as well. 2. We could increment cversion in DeleteNode itself (as I mentioned in my previous comment). But I think changes in the attached patch is a better approach. I would like to confirm whether not incrementing cversion in DeleteNode() is going to introduce any synchronization issue (since the lock on the parent will be released and then reacquired). I don't think there is any problem here, but it will be good if you can confirm this. I noticed that the same bug can occur while creating a node as well. Patch handles create and delete. I was able to test this patch using the snapshots and logs that Jermey had attached. I had to revert the logs to the state where this bug would be seen the first time. I ended up doing: rm *.1d8* *.1d9* Then I can see that "create -s /zkrsm/000000000000002d_record test" succeeds with this patch (and fails without). I have not thought about writing a test for this yet (sounds complex to me). Thanks. -Vishal
        Hide
        Vishal Kher added a comment -

        Submitting path for trunk. Does not inlcude test.

        Show
        Vishal Kher added a comment - Submitting path for trunk. Does not inlcude test.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12478444/ZOOKEEPER-1046.patch
        against trunk revision 1099329.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/251//testReport/
        Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/251//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/251//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12478444/ZOOKEEPER-1046.patch against trunk revision 1099329. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/251//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/251//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/251//console This message is automatically generated.
        Hide
        Camille Fournier added a comment -

        Hey Vishal,

        Looking at the patch at a glance it looks good to me. Would be nice to have just a flat unit test against the DataTree incrementCversion. Is there any way to take those logs that Jeremy provided and turn them into a test for this fix? It would be somewhat heavy I'll admit but it's a somewhat significant bug and I think it warrants a good test.

        Show
        Camille Fournier added a comment - Hey Vishal, Looking at the patch at a glance it looks good to me. Would be nice to have just a flat unit test against the DataTree incrementCversion. Is there any way to take those logs that Jeremy provided and turn them into a test for this fix? It would be somewhat heavy I'll admit but it's a somewhat significant bug and I think it warrants a good test.
        Hide
        Camille Fournier added a comment -

        Hey Vishal,

        Any chance you could get some sort of test for this soon? It looks like it will be a merge conflict with ZOOKEEPER-965 so it would be good to have this in first since it is a rather nasty bug and I'd like to make sure we don't make the same mistake in the multi-txn space.

        Thanks.

        Show
        Camille Fournier added a comment - Hey Vishal, Any chance you could get some sort of test for this soon? It looks like it will be a merge conflict with ZOOKEEPER-965 so it would be good to have this in first since it is a rather nasty bug and I'd like to make sure we don't make the same mistake in the multi-txn space. Thanks.
        Hide
        Vishal Kher added a comment -

        I will upload a test soon.

        Show
        Vishal Kher added a comment - I will upload a test soon.
        Hide
        Vishal Kher added a comment -

        + tests

        Show
        Vishal Kher added a comment - + tests
        Hide
        Vishal Kher added a comment -

        Added tests. Submitting patch for trunk.

        Show
        Vishal Kher added a comment - Added tests. Submitting patch for trunk.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12479099/ZOOKEEPER-1046.patch1
        against trunk revision 1099329.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 16 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/263//testReport/
        Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/263//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/263//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12479099/ZOOKEEPER-1046.patch1 against trunk revision 1099329. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 16 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/263//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/263//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/263//console This message is automatically generated.
        Hide
        Camille Fournier added a comment -

        Looks great Vishal! +1 from me

        Show
        Camille Fournier added a comment - Looks great Vishal! +1 from me
        Hide
        Patrick Hunt added a comment -

        It's not clear to me at this point, is this being fixed on 3.3 and trunk? ("fix for" indicates trunk only) Which patch is which? (typically we indicate in the patch file name if it's for a branch, vs the "simple" name if for trunk)

        Show
        Patrick Hunt added a comment - It's not clear to me at this point, is this being fixed on 3.3 and trunk? ("fix for" indicates trunk only) Which patch is which? (typically we indicate in the patch file name if it's for a branch, vs the "simple" name if for trunk)
        Hide
        Benjamin Reed added a comment -

        +1 great job figuring out the problem and fixing it! looks good to me.

        Show
        Benjamin Reed added a comment - +1 great job figuring out the problem and fixing it! looks good to me.
        Hide
        Hudson added a comment -

        Integrated in ZooKeeper-trunk #1189 (See https://builds.apache.org/hudson/job/ZooKeeper-trunk/1189/)
        ZOOKEEPER-1046: Creating a new sequential node results in a ZNODEEXISTS error

        camille : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1125581
        Files :

        • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/persistence/FileTxnSnapLog.java
        • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/DataTree.java
        • /zookeeper/trunk/src/java/test/org/apache/zookeeper/test/LoadFromLogTest.java
        • /zookeeper/trunk/CHANGES.txt
        • /zookeeper/trunk/src/java/test/org/apache/zookeeper/test/DataTreeTest.java
        Show
        Hudson added a comment - Integrated in ZooKeeper-trunk #1189 (See https://builds.apache.org/hudson/job/ZooKeeper-trunk/1189/ ) ZOOKEEPER-1046 : Creating a new sequential node results in a ZNODEEXISTS error camille : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1125581 Files : /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/persistence/FileTxnSnapLog.java /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/DataTree.java /zookeeper/trunk/src/java/test/org/apache/zookeeper/test/LoadFromLogTest.java /zookeeper/trunk/CHANGES.txt /zookeeper/trunk/src/java/test/org/apache/zookeeper/test/DataTreeTest.java
        Hide
        Camille Fournier added a comment -

        Patrick, I missed your comment in my rushing on Friday. Do we want to push this into the 3.3 branch as well?

        Show
        Camille Fournier added a comment - Patrick, I missed your comment in my rushing on Friday. Do we want to push this into the 3.3 branch as well?
        Hide
        Benjamin Reed added a comment -

        yes, we should get this into the 3.3 branch.

        Show
        Benjamin Reed added a comment - yes, we should get this into the 3.3 branch.
        Hide
        Camille Fournier added a comment -

        So I'm trying to get this patch working in 3.3, and I notice the following:

        One of the tests in the patch is a modification of LoadFromLogTest, which was introduced to show fixes for ZOOKEEPER-882. ZOOKEEPER-882 claims to have been fixed on 3.3.3 and 3.4, but the test is not there in the 3.3.3 branch. When we port fixes from trunk to branches, are we not also porting the tests?

        Show
        Camille Fournier added a comment - So I'm trying to get this patch working in 3.3, and I notice the following: One of the tests in the patch is a modification of LoadFromLogTest, which was introduced to show fixes for ZOOKEEPER-882 . ZOOKEEPER-882 claims to have been fixed on 3.3.3 and 3.4, but the test is not there in the 3.3.3 branch. When we port fixes from trunk to branches, are we not also porting the tests?
        Hide
        Camille Fournier added a comment -

        3.3 branch patch

        Show
        Camille Fournier added a comment - 3.3 branch patch
        Hide
        Camille Fournier added a comment -

        I think I got the code into a 3.3 patch. If someone could take a look and ok it I will push it back into the branch.

        Show
        Camille Fournier added a comment - I think I got the code into a 3.3 patch. If someone could take a look and ok it I will push it back into the branch.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12481458/ZOOKEEPER-1046-for333
        against trunk revision 1125581.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 14 new or modified tests.

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/305//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12481458/ZOOKEEPER-1046-for333 against trunk revision 1125581. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 14 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/305//console This message is automatically generated.
        Hide
        Benjamin Reed added a comment -

        i was reviewing this code again, and i think we still have a problem. the txns are supposed to be idempotent, but the way we maintain the cversion is not idempotent. i think it is possible for different replicas to have different cversions for the same znode.

        i think the ultimate fix is to encode the cversion of the parent into the create and delete txns.

        Show
        Benjamin Reed added a comment - i was reviewing this code again, and i think we still have a problem. the txns are supposed to be idempotent, but the way we maintain the cversion is not idempotent. i think it is possible for different replicas to have different cversions for the same znode. i think the ultimate fix is to encode the cversion of the parent into the create and delete txns.
        Hide
        Vishal Kher added a comment -

        Does this imply a transaction log version upgrade?

        Show
        Vishal Kher added a comment - Does this imply a transaction log version upgrade?
        Hide
        Benjamin Reed added a comment -

        yes, it would require a transaction log and protocol upgrade.

        Show
        Benjamin Reed added a comment - yes, it would require a transaction log and protocol upgrade.
        Hide
        Camille Fournier added a comment -

        Do we want to try to do this as part of 3.4? Are you guys addressing this in some of your other patches?

        Show
        Camille Fournier added a comment - Do we want to try to do this as part of 3.4? Are you guys addressing this in some of your other patches?
        Hide
        Benjamin Reed added a comment -

        i 3.4 would be a good time to do it since we are doing a small protocol change in the initial handshake.

        Show
        Benjamin Reed added a comment - i 3.4 would be a good time to do it since we are doing a small protocol change in the initial handshake.
        Hide
        Camille Fournier added a comment -

        OK. So who wants to do this? Should I pull this patch out of trunk and only apply it to 3.3? I'd like to get this cleared up and finalized so we know the bug and the related bugs are all fixed.

        Show
        Camille Fournier added a comment - OK. So who wants to do this? Should I pull this patch out of trunk and only apply it to 3.3? I'd like to get this cleared up and finalized so we know the bug and the related bugs are all fixed.
        Hide
        Benjamin Reed added a comment -

        yes, i think we should stick with this patch for 3.3. i think it might be better to do a patch based on what is already in trunk rather than reverting. we should definitely get the protocol change into 3.4.0. i'll try to get to it today.

        Show
        Benjamin Reed added a comment - yes, i think we should stick with this patch for 3.3. i think it might be better to do a patch based on what is already in trunk rather than reverting. we should definitely get the protocol change into 3.4.0. i'll try to get to it today.
        Hide
        Camille Fournier added a comment -

        Is this going to force me to write some kind of upgrade script for the ZK servers I have running 3.3.3?

        Show
        Camille Fournier added a comment - Is this going to force me to write some kind of upgrade script for the ZK servers I have running 3.3.3?
        Hide
        Benjamin Reed added a comment -

        no, i think we need to back it backwards compatible with 3.3.x, so we need to be able to process txns without the parent cversion. so, no upgrade needed.

        Show
        Benjamin Reed added a comment - no, i think we need to back it backwards compatible with 3.3.x, so we need to be able to process txns without the parent cversion. so, no upgrade needed.
        Hide
        Benjamin Reed added a comment -

        making a blocker to ensure that it gets into 3.4.0

        Show
        Benjamin Reed added a comment - making a blocker to ensure that it gets into 3.4.0
        Hide
        Benjamin Reed added a comment -

        it turns out that the fix is really easy and small and easy to make backwards compatible except for one killer problem: closeSession. The closeSession may do a massive amount of deletes, so potentially we would need to go through and figure out everything to delete and send a cversion for each.

        i've been looking and i think we can simplify things. cversion is only used to pick suffixes for sequential nodes, so we really only need to increment on creates. if we did this, the fix would really become trivial and would also make the numbering of creates sequential.

        of course this is the only way we use it, others may use it for something else. it would be an api semantic break.

        Show
        Benjamin Reed added a comment - it turns out that the fix is really easy and small and easy to make backwards compatible except for one killer problem: closeSession. The closeSession may do a massive amount of deletes, so potentially we would need to go through and figure out everything to delete and send a cversion for each. i've been looking and i think we can simplify things. cversion is only used to pick suffixes for sequential nodes, so we really only need to increment on creates. if we did this, the fix would really become trivial and would also make the numbering of creates sequential. of course this is the only way we use it, others may use it for something else. it would be an api semantic break.
        Hide
        Camille Fournier added a comment -

        It's ok with me to make the change and ignore deletes.

        Show
        Camille Fournier added a comment - It's ok with me to make the change and ignore deletes.
        Hide
        Flavio Junqueira added a comment -

        If cversion counts the number of created children, we can always learn the number of deleted children by subtracting the number of current children from cversion, no? I was also wondering if there is any use case you're aware of in which it needs to have both counted.

        So far the proposal of counting only creations seems good to me.

        Show
        Flavio Junqueira added a comment - If cversion counts the number of created children, we can always learn the number of deleted children by subtracting the number of current children from cversion, no? I was also wondering if there is any use case you're aware of in which it needs to have both counted. So far the proposal of counting only creations seems good to me.
        Hide
        Benjamin Reed added a comment -

        nice observation flavio! i haven't seen anyone using cversion outside of the sequence number on sequence znodes.

        Show
        Benjamin Reed added a comment - nice observation flavio! i haven't seen anyone using cversion outside of the sequence number on sequence znodes.
        Hide
        Patrick Hunt added a comment -

        I strongly oppose changing the semantics of cversion. Certainly not in 3.3, as part of a minor upgrade perhaps. While we (the core code) only use it for sequence numbering, there is no telling what users are using it for, and switching it out from under them is a very bad idea IMO. It may lead to hard to track down issues for users, users who appreciate our rock steady api.

        Can we use Flavio's suggestion to:
        1) do what Ben is suggesting, fix cversion in StatPersisted to only track creates
        2) update the cversion in Stat as Flavio is suggesting before passing to the user

        this would maintain the semantics on the client side. Downside is that it's a bit of a hack (not sure if it would even work, given we have to translate in both directions, to/from user). How bad I don't know - thoughts?

        Show
        Patrick Hunt added a comment - I strongly oppose changing the semantics of cversion. Certainly not in 3.3, as part of a minor upgrade perhaps. While we (the core code) only use it for sequence numbering, there is no telling what users are using it for, and switching it out from under them is a very bad idea IMO. It may lead to hard to track down issues for users, users who appreciate our rock steady api. Can we use Flavio's suggestion to: 1) do what Ben is suggesting, fix cversion in StatPersisted to only track creates 2) update the cversion in Stat as Flavio is suggesting before passing to the user this would maintain the semantics on the client side. Downside is that it's a bit of a hack (not sure if it would even work, given we have to translate in both directions, to/from user). How bad I don't know - thoughts?
        Hide
        Benjamin Reed added a comment -

        two clarifying points:

        • this is not for 3.3. this would be a 3.4 change. we will stick with camille's fix for 3.3
        • we never get the cversion from the user. you can't do conditional ops with it or pass it in any of the calls.
        Show
        Benjamin Reed added a comment - two clarifying points: this is not for 3.3. this would be a 3.4 change. we will stick with camille's fix for 3.3 we never get the cversion from the user. you can't do conditional ops with it or pass it in any of the calls.
        Hide
        Patrick Hunt added a comment -

        Ok, thanks for the clarification. In that case what do you think about this for 3.4+ ? Is it going to be possible to do this right, but also w/o too much overhead? (ie simply) vs the gains of changing the API? 3.4 I'm less worried about the semantic change, but I'd still like to avoid it if reasonably possible...

        Show
        Patrick Hunt added a comment - Ok, thanks for the clarification. In that case what do you think about this for 3.4+ ? Is it going to be possible to do this right, but also w/o too much overhead? (ie simply) vs the gains of changing the API? 3.4 I'm less worried about the semantic change, but I'd still like to avoid it if reasonably possible...
        Hide
        Benjamin Reed added a comment -

        actually, flavio's suggestion can go in quite well. we track the cversion in the persistentStat and there is one function that fills in Stat and we can do the fixup there: cversion*2 - numChildren. it will work and it isn't really that messy. the only thing i don't like is that it might confuse core developers to see a different cversion on disk than they see coming back from requests.

        Show
        Benjamin Reed added a comment - actually, flavio's suggestion can go in quite well. we track the cversion in the persistentStat and there is one function that fills in Stat and we can do the fixup there: cversion*2 - numChildren. it will work and it isn't really that messy. the only thing i don't like is that it might confuse core developers to see a different cversion on disk than they see coming back from requests.
        Hide
        Camille Fournier added a comment -

        I checked in the fix for the 3.3 branch
        svn commit: r1136440

        Show
        Camille Fournier added a comment - I checked in the fix for the 3.3 branch svn commit: r1136440
        Hide
        Patrick Hunt added a comment -

        @breed yet more debt for those looking for things to complain about. Still, I'm willing to take the heat if it's going to benefit users. My suggestion would be to document it in multiple places - in the jute definition file, and in the method you mention. Reference this JIRA in those comments.

        Show
        Patrick Hunt added a comment - @breed yet more debt for those looking for things to complain about. Still, I'm willing to take the heat if it's going to benefit users. My suggestion would be to document it in multiple places - in the jute definition file, and in the method you mention. Reference this JIRA in those comments.
        Hide
        Benjamin Reed added a comment -

        this patch makes the CreateTxn idempotent and uses camille's workaround with older txns.

        Show
        Benjamin Reed added a comment - this patch makes the CreateTxn idempotent and uses camille's workaround with older txns.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12483386/ZOOKEEPER-1046_2.patch
        against trunk revision 1138213.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 30 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        -1 javac. The patch appears to cause tar ant target to fail.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/345//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/345//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/345//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12483386/ZOOKEEPER-1046_2.patch against trunk revision 1138213. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 30 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The patch appears to cause tar ant target to fail. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/345//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/345//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/345//console This message is automatically generated.
        Hide
        Camille Fournier added a comment -

        To be fair, I only found the bug, Vishal K wrote the fix. Will take a look at this tomorrow am.

        Show
        Camille Fournier added a comment - To be fair, I only found the bug, Vishal K wrote the fix. Will take a look at this tomorrow am.
        Hide
        Camille Fournier added a comment -

        Looked at the patch. Besides a couple extraneous printlns it looks ok to me. Shall I remove these myself and commit this to trunk?

        Show
        Camille Fournier added a comment - Looked at the patch. Besides a couple extraneous printlns it looks ok to me. Shall I remove these myself and commit this to trunk?
        Hide
        Benjamin Reed added a comment -

        oops, i missed those. yeah if you could remove and commit that would be great. btw, does the 3.3 patch still need to go in?

        Show
        Benjamin Reed added a comment - oops, i missed those. yeah if you could remove and commit that would be great. btw, does the 3.3 patch still need to go in?
        Hide
        Camille Fournier added a comment -

        Checked in 3.3 last thursday (1136440)

        Should we be worried about the -1 javac from jenkins? I don't know what that error means since clearly this thing can compile if it can pass tests.

        Show
        Camille Fournier added a comment - Checked in 3.3 last thursday (1136440) Should we be worried about the -1 javac from jenkins? I don't know what that error means since clearly this thing can compile if it can pass tests.
        Hide
        Camille Fournier added a comment -

        removed printlns

        Show
        Camille Fournier added a comment - removed printlns
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12483478/zookeeper-1046-3
        against trunk revision 1138213.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 30 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        -1 javac. The patch appears to cause tar ant target to fail.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/347//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/347//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/347//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12483478/zookeeper-1046-3 against trunk revision 1138213. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 30 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The patch appears to cause tar ant target to fail. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/347//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/347//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/347//console This message is automatically generated.
        Hide
        Camille Fournier added a comment -

        Ah, here is the error:

        zookeeperbuildcontrib.compile:
        [echo] contrib: loggraph
        [javac] Compiling 34 source files to /grid/0/hudson/hudson-slave/workspace/PreCommit-ZOOKEEPER-Build/trunk/build/contrib/loggraph/classes
        [javac] /grid/0/hudson/hudson-slave/workspace/PreCommit-ZOOKEEPER-Build/trunk/src/contrib/loggraph/src/java/org/apache/zookeeper/graph/TxnLogSource.java:185: deserializeTxn(byte[],org.apache.zookeeper.txn.TxnHeader) in org.apache.zookeeper.server.util.SerializeUtils cannot be applied to (org.apache.jute.InputArchive,org.apache.zookeeper.txn.TxnHeader)
        [javac] Record r = SerializeUtils.deserializeTxn(iab, hdr);
        [javac] ^
        [javac] /grid/0/hudson/hudson-slave/workspace/PreCommit-ZOOKEEPER-Build/trunk/src/contrib/loggraph/src/java/org/apache/zookeeper/graph/TxnLogSource.java:333: deserializeTxn(byte[],org.apache.zookeeper.txn.TxnHeader) in org.apache.zookeeper.server.util.SerializeUtils cannot be applied to (org.apache.jute.InputArchive,org.apache.zookeeper.txn.TxnHeader)
        [javac] Record r = SerializeUtils.deserializeTxn(iab, hdr);
        [javac] ^
        [javac] Note: Some input files use unchecked or unsafe operations.
        [javac] Note: Recompile with -Xlint:unchecked for details.
        [javac] 2 errors

        Show
        Camille Fournier added a comment - Ah, here is the error: zookeeperbuildcontrib.compile: [echo] contrib: loggraph [javac] Compiling 34 source files to /grid/0/hudson/hudson-slave/workspace/PreCommit-ZOOKEEPER-Build/trunk/build/contrib/loggraph/classes [javac] /grid/0/hudson/hudson-slave/workspace/PreCommit-ZOOKEEPER-Build/trunk/src/contrib/loggraph/src/java/org/apache/zookeeper/graph/TxnLogSource.java:185: deserializeTxn(byte[],org.apache.zookeeper.txn.TxnHeader) in org.apache.zookeeper.server.util.SerializeUtils cannot be applied to (org.apache.jute.InputArchive,org.apache.zookeeper.txn.TxnHeader) [javac] Record r = SerializeUtils.deserializeTxn(iab, hdr); [javac] ^ [javac] /grid/0/hudson/hudson-slave/workspace/PreCommit-ZOOKEEPER-Build/trunk/src/contrib/loggraph/src/java/org/apache/zookeeper/graph/TxnLogSource.java:333: deserializeTxn(byte[],org.apache.zookeeper.txn.TxnHeader) in org.apache.zookeeper.server.util.SerializeUtils cannot be applied to (org.apache.jute.InputArchive,org.apache.zookeeper.txn.TxnHeader) [javac] Record r = SerializeUtils.deserializeTxn(iab, hdr); [javac] ^ [javac] Note: Some input files use unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. [javac] 2 errors
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12483488/zookeeper-1046-4.patch
        against trunk revision 1138213.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 30 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        -1 javac. The patch appears to cause tar ant target to fail.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/348//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/348//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/348//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12483488/zookeeper-1046-4.patch against trunk revision 1138213. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 30 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The patch appears to cause tar ant target to fail. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/348//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/348//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/348//console This message is automatically generated.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12483497/zookeeper-1046-5.patch
        against trunk revision 1138595.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 35 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/349//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/349//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/349//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12483497/zookeeper-1046-5.patch against trunk revision 1138595. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 35 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/349//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/349//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/349//console This message is automatically generated.
        Hide
        Benjamin Reed added a comment -

        +1 good find. sorry i missed the contrib.

        Show
        Benjamin Reed added a comment - +1 good find. sorry i missed the contrib.
        Hide
        Camille Fournier added a comment -

        committed to trunk:
        New Revision: 1138957
        New Revision: 1138958 (jute file)

        Show
        Camille Fournier added a comment - committed to trunk: New Revision: 1138957 New Revision: 1138958 (jute file)
        Hide
        Hudson added a comment -

        Integrated in ZooKeeper-trunk #1223 (See https://builds.apache.org/job/ZooKeeper-trunk/1223/)
        ZOOKEEPER-1046: Creating a new sequential node results in a ZNODEEXISTS error
        ZOOKEEPER-1046: Creating a new sequential node results in a ZNODEEXISTS error

        camille : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1138958
        Files :

        • /zookeeper/trunk/src/zookeeper.jute

        camille : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1138957
        Files :

        • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/persistence/FileTxnSnapLog.java
        • /zookeeper/trunk/src/java/test/org/apache/zookeeper/server/SerializationPerfTest.java
        • /zookeeper/trunk/src/java/test/org/apache/zookeeper/server/DeserializationPerfTest.java
        • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/util/SerializeUtils.java
        • /zookeeper/trunk/src/java/test/org/apache/zookeeper/test/DataTreeTest.java
        • /zookeeper/trunk/src/contrib/loggraph/src/java/org/apache/zookeeper/graph/TxnLogSource.java
        • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/PrepRequestProcessor.java
        • /zookeeper/trunk/src/java/test/org/apache/zookeeper/test/ReadOnlyModeTest.java
        • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/DataNode.java
        • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/DataTree.java
        • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/persistence/FileTxnLog.java
        • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/LogFormatter.java
        • /zookeeper/trunk/src/java/test/org/apache/zookeeper/test/LoadFromLogTest.java
        • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/Follower.java
        • /zookeeper/trunk/src/java/test/org/apache/zookeeper/server/DataTreeUnitTest.java
        • /zookeeper/trunk/CHANGES.txt
        • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/LearnerHandler.java
        • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/Learner.java
        • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/upgrade/UpgradeSnapShotV1.java
        • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/Observer.java
        Show
        Hudson added a comment - Integrated in ZooKeeper-trunk #1223 (See https://builds.apache.org/job/ZooKeeper-trunk/1223/ ) ZOOKEEPER-1046 : Creating a new sequential node results in a ZNODEEXISTS error ZOOKEEPER-1046 : Creating a new sequential node results in a ZNODEEXISTS error camille : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1138958 Files : /zookeeper/trunk/src/zookeeper.jute camille : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1138957 Files : /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/persistence/FileTxnSnapLog.java /zookeeper/trunk/src/java/test/org/apache/zookeeper/server/SerializationPerfTest.java /zookeeper/trunk/src/java/test/org/apache/zookeeper/server/DeserializationPerfTest.java /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/util/SerializeUtils.java /zookeeper/trunk/src/java/test/org/apache/zookeeper/test/DataTreeTest.java /zookeeper/trunk/src/contrib/loggraph/src/java/org/apache/zookeeper/graph/TxnLogSource.java /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/PrepRequestProcessor.java /zookeeper/trunk/src/java/test/org/apache/zookeeper/test/ReadOnlyModeTest.java /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/DataNode.java /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/DataTree.java /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/persistence/FileTxnLog.java /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/LogFormatter.java /zookeeper/trunk/src/java/test/org/apache/zookeeper/test/LoadFromLogTest.java /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/Follower.java /zookeeper/trunk/src/java/test/org/apache/zookeeper/server/DataTreeUnitTest.java /zookeeper/trunk/CHANGES.txt /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/LearnerHandler.java /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/Learner.java /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/upgrade/UpgradeSnapShotV1.java /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/Observer.java
        Hide
        Mahadev konar added a comment -

        Camille/Ben, did you forget to resolve thei jira? Is this already committed to 3.4 release?

        Show
        Mahadev konar added a comment - Camille/Ben, did you forget to resolve thei jira? Is this already committed to 3.4 release?
        Hide
        Camille Fournier added a comment -

        trunk:
        New Revision: 1138957
        New Revision: 1138958 (jute file)
        3.3:
        svn commit: r1136440

        Show
        Camille Fournier added a comment - trunk: New Revision: 1138957 New Revision: 1138958 (jute file) 3.3: svn commit: r1136440
        Hide
        Vishal Kher added a comment -

        Hi Ben, Camille,

        Sorry for the delayed response. I don't think this patch fixes the bug entirely.

        — src/java/main/org/apache/zookeeper/server/persistence/FileTxnSnapLog.java (revision 1138571)
        +++ src/java/main/org/apache/zookeeper/server/persistence/FileTxnSnapLog.java (working copy)
        [...]
        + if ((hdr.getType() == OpCode.create &&
        + rc.err == Code.NODEEXISTS.intValue()) &&
        + ((CreateTxn)txn).getParentCVersion() == -1) {

        The if condition above is applied only when we are using older version of CreateTxn. Don't we need the part that sets the value for cversion when ((CreateTxn)txn).getParentCVersion() is > 0? Looks like we need an else here that sets parent cversion to the cversion in createTxn. Am I missing something?

        -Vishal

        Show
        Vishal Kher added a comment - Hi Ben, Camille, Sorry for the delayed response. I don't think this patch fixes the bug entirely. — src/java/main/org/apache/zookeeper/server/persistence/FileTxnSnapLog.java (revision 1138571) +++ src/java/main/org/apache/zookeeper/server/persistence/FileTxnSnapLog.java (working copy) [...] + if ((hdr.getType() == OpCode.create && + rc.err == Code.NODEEXISTS.intValue()) && + ((CreateTxn)txn).getParentCVersion() == -1) { The if condition above is applied only when we are using older version of CreateTxn. Don't we need the part that sets the value for cversion when ((CreateTxn)txn).getParentCVersion() is > 0? Looks like we need an else here that sets parent cversion to the cversion in createTxn. Am I missing something? -Vishal
        Hide
        Vishal Kher added a comment -

        Reopening this bug based on my previous comment.

        Show
        Vishal Kher added a comment - Reopening this bug based on my previous comment.
        Hide
        Camille Fournier added a comment -

        I think you are right, Vishal, good catch. The perils of making changes without reproducible tests! Do you want to submit a fix for it?

        Show
        Camille Fournier added a comment - I think you are right, Vishal, good catch. The perils of making changes without reproducible tests! Do you want to submit a fix for it?
        Hide
        Vishal Kher added a comment -

        I can work on the patch. I will see if we can have a test that fails without the patch.

        Show
        Vishal Kher added a comment - I can work on the patch. I will see if we can have a test that fails without the patch.
        Hide
        Vishal Kher added a comment -


        Attaching patch for trunk. For some reason some of the core tests are failing for me (and also on hudson). The test that is relevant to this patch is passing.

        Show
        Vishal Kher added a comment - Attaching patch for trunk. For some reason some of the core tests are failing for me (and also on hudson). The test that is relevant to this patch is passing.
        Hide
        Vishal Kher added a comment -

        for trunk.

        Show
        Vishal Kher added a comment - for trunk.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12486118/ZOOKEEPER-1046-6.patch
        against trunk revision 1144087.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 16 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/389//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/389//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/389//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12486118/ZOOKEEPER-1046-6.patch against trunk revision 1144087. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 16 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/389//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/389//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/389//console This message is automatically generated.
        Hide
        Camille Fournier added a comment -

        Looks great Vishal, thanks for seeing this through. Only question is whether we want to increase the log level to info from debug in the case that we see the NODEEXISTS failure in FileTxnSnapLog. I suspect we want to leave it at debug, if you agree I'll just correct that and check it in.

        Show
        Camille Fournier added a comment - Looks great Vishal, thanks for seeing this through. Only question is whether we want to increase the log level to info from debug in the case that we see the NODEEXISTS failure in FileTxnSnapLog. I suspect we want to leave it at debug, if you agree I'll just correct that and check it in.
        Hide
        Vishal Kher added a comment -

        Hi Camille,

        Thanks. I am ok with that change. I am hoping it won't be too much of an effort for you as last time to merge this in 3.3.

        Show
        Vishal Kher added a comment - Hi Camille, Thanks. I am ok with that change. I am hoping it won't be too much of an effort for you as last time to merge this in 3.3.
        Hide
        Camille Fournier added a comment -

        Nah, since it only applies to the 3.4 release (ben's change is just for 3.4), it is no effort at all. Thanks!

        Show
        Camille Fournier added a comment - Nah, since it only applies to the 3.4 release (ben's change is just for 3.4), it is no effort at all. Thanks!
        Hide
        Camille Fournier added a comment -

        Checked in Vishal's fix: r1146025 - in /zookeeper/trunk: ./ src/java/main/org/apache/zookeeper/server/ src/java/main/org/apache/zookeeper/server/persistence/ src/java/test/org/apache/zookeeper/test/

        Show
        Camille Fournier added a comment - Checked in Vishal's fix: r1146025 - in /zookeeper/trunk: ./ src/java/main/org/apache/zookeeper/server/ src/java/main/org/apache/zookeeper/server/persistence/ src/java/test/org/apache/zookeeper/test/
        Hide
        Mahadev konar added a comment -

        camille/vishal,
        Are we waiting for this to be ported to 3.3 branch before resolving this?

        Show
        Mahadev konar added a comment - camille/vishal, Are we waiting for this to be ported to 3.3 branch before resolving this?
        Hide
        Camille Fournier added a comment -

        I'm waiting on the build to run successfully before I resolve it.

        Show
        Camille Fournier added a comment - I'm waiting on the build to run successfully before I resolve it.
        Hide
        Mahadev konar added a comment -

        great, thanks!

        Show
        Mahadev konar added a comment - great, thanks!
        Hide
        Hudson added a comment -

        Integrated in ZooKeeper-trunk #1243 (See https://builds.apache.org/job/ZooKeeper-trunk/1243/)
        ZOOKEEPER-1046: Creating a new sequential node results in a ZNODEEXISTS error

        camille : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1146025
        Files :

        • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/persistence/FileTxnSnapLog.java
        • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/DataTree.java
        • /zookeeper/trunk/src/java/test/org/apache/zookeeper/test/LoadFromLogTest.java
        • /zookeeper/trunk/CHANGES.txt
        • /zookeeper/trunk/src/java/test/org/apache/zookeeper/test/DataTreeTest.java
        Show
        Hudson added a comment - Integrated in ZooKeeper-trunk #1243 (See https://builds.apache.org/job/ZooKeeper-trunk/1243/ ) ZOOKEEPER-1046 : Creating a new sequential node results in a ZNODEEXISTS error camille : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1146025 Files : /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/persistence/FileTxnSnapLog.java /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/DataTree.java /zookeeper/trunk/src/java/test/org/apache/zookeeper/test/LoadFromLogTest.java /zookeeper/trunk/CHANGES.txt /zookeeper/trunk/src/java/test/org/apache/zookeeper/test/DataTreeTest.java
        Hide
        Ted Dunning added a comment -

        Camille,

        I think you are right. The code for multi needs to handle whatever it has to in order to guarantee idempotency.

        Show
        Ted Dunning added a comment - Camille, I think you are right. The code for multi needs to handle whatever it has to in order to guarantee idempotency.

          People

          • Assignee:
            Vishal Kher
            Reporter:
            Jeremy Stribling
          • Votes:
            2 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development