ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-1412

java client watches inconsistently triggered on reconnect

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 3.3.3, 3.3.4, 3.4.0, 3.4.1, 3.4.2, 3.4.3
    • Fix Version/s: 3.3.5, 3.4.4, 3.5.0
    • Component/s: server
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      I've observed an inconsistent behavior in java client watches. The inconsistency relates to the behavior after the client reconnects to the zookeeper ensemble.

      After the client reconnects to the ensemble only those watches should trigger which should have been triggered also if the connections was not lost. This means if I watch for changes in node /foo and there is no change there than my watch should not be triggered on reconnecting to the ensemble.
      This is not always the case in the java client.

      I've debugged the issues and I could locate the case when the watch is always triggered on reconnect. This is consistently happening if I connect to a follower in the ensemble and I don't do any operation which goes through the leader.
      Looking at the code I see that the client stores the lastzxid and sends that with its request. This is 0 on startup and will be updated everytime from the server replies. This lastzxid is also sent to the server after reconnect together with watches. The server decides which watch to trigger based on this lastzxid probably because that should mean the last known state of the client. If this lastzxid is 0 than all the watches are triggered.
      I've checked why is this lastzxid 0. I thought it shouldn't be since there was already a request to the server to set the watch and in the reply the server could have sent back the zxid but it turns out that it sends just 0. Looking at the server code I see that for requests which doesn't go through the leader the follower server just sends back the same zxid that the client sent.

      1. ZOOKEEPER-1412_br33.patch
        6 kB
        Patrick Hunt
      2. ZOOKEEPER-1412_br33.patch
        6 kB
        Patrick Hunt
      3. ZOOKEEPER-1412_br34.patch
        8 kB
        Patrick Hunt
      4. ZOOKEEPER-1412_br34.patch
        8 kB
        Patrick Hunt
      5. ZOOKEEPER-1412_trunk.patch
        8 kB
        Patrick Hunt
      6. ZOOKEEPER-1412_trunk.patch
        8 kB
        Patrick Hunt

        Issue Links

          Activity

          Hide
          Patrick Hunt added a comment -

          Hi César, I just tried reproducing using the java cli (zk shell) client and was unable (iirc I was able to reproduce previously) using the latest on branch-3.4 (includes this fix).

          started client against a follower (3 server ensemble)

          WatchedEvent state:SyncConnected type:None path:null
          [zk: (CONNECTED) 0] ls /foo 1

          killed the server
          WatchedEvent state:Disconnected type:None path:null

          restarted the server
          WatchedEvent state:SyncConnected type:None path:null

          that was it

          I then started a second client and deleted /foo, which caused

          WatchedEvent state:SyncConnected type:NodeDeleted path:/foo

          so afaict it looks right to me.

          If you still see issues please open a separate jira. A test that reproduces this would also be useful.

          Show
          Patrick Hunt added a comment - Hi César, I just tried reproducing using the java cli (zk shell) client and was unable (iirc I was able to reproduce previously) using the latest on branch-3.4 (includes this fix). started client against a follower (3 server ensemble) WatchedEvent state:SyncConnected type:None path:null [zk: (CONNECTED) 0] ls /foo 1 killed the server WatchedEvent state:Disconnected type:None path:null restarted the server WatchedEvent state:SyncConnected type:None path:null that was it I then started a second client and deleted /foo, which caused WatchedEvent state:SyncConnected type:NodeDeleted path:/foo so afaict it looks right to me. If you still see issues please open a separate jira. A test that reproduces this would also be useful.
          Hide
          César Álvarez Núñez added a comment -

          Once applied this fix, a NodeDataChanged is received instead of a NodeCreated.
          More information in user mailing list thread "Unexpected NodeCreated with 3.4.3?"
          Affected versions: 3.3.5 and 3.4.3 + ZK-1412

          Show
          César Álvarez Núñez added a comment - Once applied this fix, a NodeDataChanged is received instead of a NodeCreated. More information in user mailing list thread "Unexpected NodeCreated with 3.4.3?" Affected versions: 3.3.5 and 3.4.3 + ZK-1412
          Hide
          Hudson added a comment -

          Integrated in ZooKeeper-trunk #1493 (See https://builds.apache.org/job/ZooKeeper-trunk/1493/)
          ZOOKEEPER-1412. java client watches inconsistently triggered on reconnect (phunt) (Revision 1301089)

          Result = SUCCESS
          phunt : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1301089
          Files :

          • /zookeeper/trunk/CHANGES.txt
          • /zookeeper/trunk/src/java/main/org/apache/zookeeper/ClientCnxn.java
          • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/FinalRequestProcessor.java
          • /zookeeper/trunk/src/java/test/org/apache/zookeeper/TestableZooKeeper.java
          • /zookeeper/trunk/src/java/test/org/apache/zookeeper/test/FollowerResyncConcurrencyTest.java
          Show
          Hudson added a comment - Integrated in ZooKeeper-trunk #1493 (See https://builds.apache.org/job/ZooKeeper-trunk/1493/ ) ZOOKEEPER-1412 . java client watches inconsistently triggered on reconnect (phunt) (Revision 1301089) Result = SUCCESS phunt : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1301089 Files : /zookeeper/trunk/CHANGES.txt /zookeeper/trunk/src/java/main/org/apache/zookeeper/ClientCnxn.java /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/FinalRequestProcessor.java /zookeeper/trunk/src/java/test/org/apache/zookeeper/TestableZooKeeper.java /zookeeper/trunk/src/java/test/org/apache/zookeeper/test/FollowerResyncConcurrencyTest.java
          Hide
          Mahadev konar added a comment -

          I think its ok to go with the current patch for 3.3 line and update the patch to match the zxid's for 3.4 and trunk. I agree with Pat on keeping the changes minimal for 3.3 line.

          Show
          Mahadev konar added a comment - I think its ok to go with the current patch for 3.3 line and update the patch to match the zxid's for 3.4 and trunk. I agree with Pat on keeping the changes minimal for 3.3 line.
          Hide
          Patrick Hunt added a comment -

          Honestly I didn't have time to look into all of these scenarios, but I have a concern wrt introducing this change into 3.3 code line, I'd be more comfortable look at such a change for 3.4 or 3.5. For example, if we did introduce this further changes, is there any chance to get a ping response which updates the lastzxid, then disconnect from the server before getting a notification that would then be missed on reconnect?

          I'd like to cut a 3.3.5 with what I have so far, which clearly fixes the issue identified (and 3.3.5 fixes another critical issues to get in user's hands). We can then look at further improvements in a separate jira. Make sense?

          Show
          Patrick Hunt added a comment - Honestly I didn't have time to look into all of these scenarios, but I have a concern wrt introducing this change into 3.3 code line, I'd be more comfortable look at such a change for 3.4 or 3.5. For example, if we did introduce this further changes, is there any chance to get a ping response which updates the lastzxid, then disconnect from the server before getting a notification that would then be missed on reconnect? I'd like to cut a 3.3.5 with what I have so far, which clearly fixes the issue identified (and 3.3.5 fixes another critical issues to get in user's hands). We can then look at further improvements in a separate jira. Make sense?
          Hide
          Benjamin Reed added a comment -

          i agree with botond that we should also fix the java client to update lastzxid on every message. it might be worth doing in a separate jira.

          Show
          Benjamin Reed added a comment - i agree with botond that we should also fix the java client to update lastzxid on every message. it might be worth doing in a separate jira.
          Hide
          Botond Hejj added a comment -

          Thanks for the patch Patrick.

          As I mentioned in my second mail I see a difference between java and c client in regards how they update lastzxid. The c client updates it for all operations and java not. This bug was not surfacing when c client was used because c client updates the lastzxid on every ping.

          Do you see any problem with this inconsistency between the c and java client? Is it worth changing one to behave like the other?

          Show
          Botond Hejj added a comment - Thanks for the patch Patrick. As I mentioned in my second mail I see a difference between java and c client in regards how they update lastzxid. The c client updates it for all operations and java not. This bug was not surfacing when c client was used because c client updates the lastzxid on every ping. Do you see any problem with this inconsistency between the c and java client? Is it worth changing one to behave like the other?
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12518111/ZOOKEEPER-1412_trunk.patch
          against trunk revision 1297740.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/991//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/991//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/991//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12518111/ZOOKEEPER-1412_trunk.patch against trunk revision 1297740. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/991//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/991//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/991//console This message is automatically generated.
          Hide
          Patrick Hunt added a comment -

          Ok, updated with comments. Good to see I was right (at least initially). thanks.

          Show
          Patrick Hunt added a comment - Ok, updated with comments. Good to see I was right (at least initially). thanks.
          Hide
          Mahadev konar added a comment -

          Looks good. +1 for the patch (pending fixing ben's comments ).

          Show
          Mahadev konar added a comment - Looks good. +1 for the patch (pending fixing ben's comments ).
          Hide
          Benjamin Reed added a comment -

          +1 looks good pat. my one small comment is that you comment one of the methods in TestableZooKeeper, but not the other. since it is kind of a public interface it might be good to comment both.

          Show
          Benjamin Reed added a comment - +1 looks good pat. my one small comment is that you comment one of the methods in TestableZooKeeper, but not the other. since it is kind of a public interface it might be good to comment both.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12518091/ZOOKEEPER-1412_trunk.patch
          against trunk revision 1297740.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/990//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/990//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/990//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12518091/ZOOKEEPER-1412_trunk.patch against trunk revision 1297740. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/990//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/990//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/990//console This message is automatically generated.
          Hide
          Patrick Hunt added a comment -

          On second thought it does look like this there's a problem with this approach. If the follower sees a commit subsequent to doing the read op, but prior to getting the lastzxid in FRP the client will end up getting an invalid lastzxid. I'll have to look at how to handle this a different way. Canceling for now. Any ideas? For the write case the leader is updating the request zxid, perhaps we need to do that here. The alternative is to have the read ops return the lastzxid, but that's going to be pretty disruptive...

          Show
          Patrick Hunt added a comment - On second thought it does look like this there's a problem with this approach. If the follower sees a commit subsequent to doing the read op, but prior to getting the lastzxid in FRP the client will end up getting an invalid lastzxid. I'll have to look at how to handle this a different way. Canceling for now. Any ideas? For the write case the leader is updating the request zxid, perhaps we need to do that here. The alternative is to have the read ops return the lastzxid, but that's going to be pretty disruptive...
          Hide
          Patrick Hunt added a comment -

          These three patches fix the issue on the respective branches. I verified the tests failed before the fix, and passed subsequent to the fix. All tests are passing for me on the three branches.

          I believe accessing/returning the lastzxid as I do in FinalRequestProcessor is valid, but it took me some time to convince myself. Please double check that the lastzxid I send back to the client is the one corresponding to the zxid at the time the read was performed

          Show
          Patrick Hunt added a comment - These three patches fix the issue on the respective branches. I verified the tests failed before the fix, and passed subsequent to the fix. All tests are passing for me on the three branches. I believe accessing/returning the lastzxid as I do in FinalRequestProcessor is valid, but it took me some time to convince myself. Please double check that the lastzxid I send back to the client is the one corresponding to the zxid at the time the read was performed
          Hide
          Patrick Hunt added a comment -

          This is a serious issue, although it does have a pretty straightforward workaround. I'll respin a new 3.3.5 rc that includes a fix for this (I'm working on it).

          Show
          Patrick Hunt added a comment - This is a serious issue, although it does have a pretty straightforward workaround. I'll respin a new 3.3.5 rc that includes a fix for this (I'm working on it).

            People

            • Assignee:
              Patrick Hunt
              Reporter:
              Botond Hejj
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development