ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-1057

zookeeper c-client, connection to offline server fails to successfully fallback to second zk host

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 3.3.1, 3.3.2, 3.3.3
    • Fix Version/s: 3.4.6, 3.5.0
    • Component/s: c client
    • Labels:
      None
    • Environment:

      Description

      Hello, I'm a contributor for the node.js zookeeper module: https://github.com/yfinkelstein/node-zookeeper
      i'm using zk 3.3.3 for the purposes of this issue, but i have validated it fails on 3.3.1 and 3.3.2

      i'm having an issue when trying to connect when one of my zookeeper servers is offline.
      if the first server attempted is online, all is good.

      if the offline server is attempted first, then the client is never able to connect to any server.
      inside zookeeper.c a connection loss (-4) is received, the socket is closed and buffers are cleaned up, it then attempts the next server in the list, creates a new socket (which gets the same fd as the previously closed socket) and connecting fails, and it continues to fail seemingly forever.
      The nature of this "fail" is not that it gets -4 connection loss errors, but that zookeeper_interest doesn't find anything going on on the socket before the user provided timeout kicks things out. I don't want to have to wait 5 minutes, even if i could make myself.

      this is the message that follows the connection loss:
      2011-04-27 23:18:28,355:13485:ZOO_ERROR@handle_socket_error_msg@1530: Socket [127.0.0.1:5020] zk retcode=-7, errno=60(Operation timed out): connection timed out (exceeded timeout by 3ms)
      2011-04-27 23:18:28,355:13485:ZOO_ERROR@yield@213: yield:zookeeper_interest returned error: -7 - operation timeout

      While investigating, i decided to comment out close(zh->fd) in handle_error (zookeeper.c#1153)
      now everything works (obviously i'm leaking an fd). Connection the the second host works immediately.
      this is the behavior i'm looking for, though i clearly don't want to leak the fd, so i'm wondering why the fd re-use is causing this issue.
      close() is not returning an error (i checked even though current code assumes success).

      i'm on osx 10.6.7
      i tried adding a setsockopt so_linger (though i didn't want that to be a solution), it didn't work.

      full debug traces are included in issue here: https://github.com/yfinkelstein/node-zookeeper/issues/6

      1. ZOOKEEPER-1057.patch
        1 kB
        Germán Blanco
      2. ZOOKEEPER-1057.patch
        1 kB
        Germán Blanco
      3. ZOOKEEPER-1057.patch
        1 kB
        Germán Blanco
      4. ZOOKEEPER-1057.patch
        4 kB
        Germán Blanco
      5. ZOOKEEPER-1057-b3.4.patch
        4 kB
        Germán Blanco
      6. ZOOKEEPER-1057.patch
        4 kB
        Michi Mutsuzaki
      7. ZOOKEEPER-1057.patch
        4 kB
        Michi Mutsuzaki

        Activity

        Hide
        Mahadev konar added a comment -

        Woody,
        Would you be uploading a patch for this?

        Show
        Mahadev konar added a comment - Woody, Would you be uploading a patch for this?
        Hide
        Mahadev konar added a comment -

        Not a blocker.

        Show
        Mahadev konar added a comment - Not a blocker.
        Hide
        Mahadev konar added a comment -

        Ben,
        Can you take a look at this? Do you think this should be a blocker for 3.4?

        Show
        Mahadev konar added a comment - Ben, Can you take a look at this? Do you think this should be a blocker for 3.4?
        Hide
        Patrick Hunt added a comment -

        Is this still and issue? Sounds pretty serious.

        Show
        Patrick Hunt added a comment - Is this still and issue? Sounds pretty serious.
        Hide
        Flavio Junqueira added a comment -

        It would be good indeed if someone could have a look at this. The description gives some pretty good hint of where the problem is, so it might not be too hard.

        Show
        Flavio Junqueira added a comment - It would be good indeed if someone could have a look at this. The description gives some pretty good hint of where the problem is, so it might not be too hard.
        Hide
        Flavio Junqueira added a comment -

        I'm not sure what the description mean with reusing the file descriptor. We set zh->fd to -1 after the close in handle_error and when we create a socket it should get a new fd, no?

        Show
        Flavio Junqueira added a comment - I'm not sure what the description mean with reusing the file descriptor. We set zh->fd to -1 after the close in handle_error and when we create a socket it should get a new fd, no?
        Hide
        Javier Manteiga added a comment -

        On holidays. Back on December 17th.

        I will read mails from time to time.

        Show
        Javier Manteiga added a comment - On holidays. Back on December 17th. I will read mails from time to time.
        Hide
        Michi Mutsuzaki added a comment -

        I'll take a look.

        Show
        Michi Mutsuzaki added a comment - I'll take a look.
        Hide
        Michi Mutsuzaki added a comment -

        This patch adds a test to validate that the c client gets connected to the second server in the list if the first server is down when zookeeper_init is called.

        Show
        Michi Mutsuzaki added a comment - This patch adds a test to validate that the c client gets connected to the second server in the list if the first server is down when zookeeper_init is called.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12618115/ZOOKEEPER-1057.patch
        against trunk revision 1549960.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 11 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1819//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1819//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1819//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12618115/ZOOKEEPER-1057.patch against trunk revision 1549960. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 11 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1819//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1819//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1819//console This message is automatically generated.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12618115/ZOOKEEPER-1057.patch
        against trunk revision 1549960.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 11 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1820//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1820//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1820//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12618115/ZOOKEEPER-1057.patch against trunk revision 1549960. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 11 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1820//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1820//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1820//console This message is automatically generated.
        Hide
        Flavio Junqueira added a comment -

        Core tests are failing...

        Show
        Flavio Junqueira added a comment - Core tests are failing...
        Hide
        Michi Mutsuzaki added a comment -

        Trying again.

        Show
        Michi Mutsuzaki added a comment - Trying again.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12618139/ZOOKEEPER-1057.patch
        against trunk revision 1549960.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 8 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1823//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1823//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1823//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12618139/ZOOKEEPER-1057.patch against trunk revision 1549960. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 8 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1823//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1823//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1823//console This message is automatically generated.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12618139/ZOOKEEPER-1057.patch
        against trunk revision 1549960.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 8 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1824//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1824//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1824//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12618139/ZOOKEEPER-1057.patch against trunk revision 1549960. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 8 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1824//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1824//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1824//console This message is automatically generated.
        Hide
        Germán Blanco added a comment -

        Do we know if this is a problem in the test case or it might be an error in the implementation?
        Because if there is any chance that this is a bug, then I think this JIRA should be marked as a Blocker.

        Show
        Germán Blanco added a comment - Do we know if this is a problem in the test case or it might be an error in the implementation? Because if there is any chance that this is a bug, then I think this JIRA should be marked as a Blocker.
        Hide
        Michi Mutsuzaki added a comment -

        This is a problem with the test case, and we still don't know if there is a bug in the c client. The test is failing because zoo_deterministic_conn_order() is broken with the dynamic reconfig change. I'll mark this as a blocker until we know for sure there is no bug in the client code.

        Show
        Michi Mutsuzaki added a comment - This is a problem with the test case, and we still don't know if there is a bug in the c client. The test is failing because zoo_deterministic_conn_order() is broken with the dynamic reconfig change. I'll mark this as a blocker until we know for sure there is no bug in the client code.
        Hide
        Flavio Junqueira added a comment -

        If this is a change due to reconfig, do we really need to block 3.4.6?

        Show
        Flavio Junqueira added a comment - If this is a change due to reconfig, do we really need to block 3.4.6?
        Hide
        Germán Blanco added a comment -

        ZOOKEEPER-1057-b3.4 is the port of Michi's test case to 3.4 branch. It fails for me.
        It uses a standalone server, instead of TestQuorumServer, since I figured that we just need a server listening on one port to test this.

        Show
        Germán Blanco added a comment - ZOOKEEPER-1057 -b3.4 is the port of Michi's test case to 3.4 branch. It fails for me. It uses a standalone server, instead of TestQuorumServer, since I figured that we just need a server listening on one port to test this.
        Hide
        Germán Blanco added a comment -

        The trunk version of the b3.4 patch, for whatever it is worth.
        I guess it will work just as Michi's patch.

        Show
        Germán Blanco added a comment - The trunk version of the b3.4 patch, for whatever it is worth. I guess it will work just as Michi's patch.
        Hide
        Germán Blanco added a comment -

        The test is simpler and looks better if integrated into TestClient.cc.
        The attached patch can be applied both to trunk and branch 3.4.
        With this version, the test case passes for the single threaded version, but for the multithreaded version it hangs forever (or at least more than a few minutes).

        Show
        Germán Blanco added a comment - The test is simpler and looks better if integrated into TestClient.cc. The attached patch can be applied both to trunk and branch 3.4. With this version, the test case passes for the single threaded version, but for the multithreaded version it hangs forever (or at least more than a few minutes).
        Hide
        Germán Blanco added a comment -

        The attached patch has a proposed test case that passes both in trunk and 3.4.
        It was my mistake, one zookeeper_close too many in the last patch.

        Show
        Germán Blanco added a comment - The attached patch has a proposed test case that passes both in trunk and 3.4. It was my mistake, one zookeeper_close too many in the last patch.
        Hide
        Germán Blanco added a comment -

        ... and now with the deterministic connection order, to make sure it is not just luck that it was working.
        I am very sorry for the spam, I think I need the holidays.

        Show
        Germán Blanco added a comment - ... and now with the deterministic connection order, to make sure it is not just luck that it was working. I am very sorry for the spam, I think I need the holidays.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12619990/ZOOKEEPER-1057.patch
        against trunk revision 1552946.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1856//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1856//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1856//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12619990/ZOOKEEPER-1057.patch against trunk revision 1552946. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1856//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1856//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1856//console This message is automatically generated.
        Hide
        Flavio Junqueira added a comment -

        +1, it is good for me. Michi Mutsuzaki, could you give it a look, please?

        Show
        Flavio Junqueira added a comment - +1, it is good for me. Michi Mutsuzaki , could you give it a look, please?
        Hide
        Michi Mutsuzaki added a comment -

        +1 The patch looks good to me, though I thought zoo_deterministic_conn_order was broken in trunk. I'll commit this patch and open a separate jira when I have a test case that shows zoo_deterministic_conn_order is broken.

        Show
        Michi Mutsuzaki added a comment - +1 The patch looks good to me, though I thought zoo_deterministic_conn_order was broken in trunk. I'll commit this patch and open a separate jira when I have a test case that shows zoo_deterministic_conn_order is broken.
        Show
        Michi Mutsuzaki added a comment - trunk: http://svn.apache.org/viewvc?view=revision&revision=1556948 3.4: http://svn.apache.org/viewvc?view=revision&revision=1556949
        Hide
        Hudson added a comment -

        SUCCESS: Integrated in ZooKeeper-trunk #2181 (See https://builds.apache.org/job/ZooKeeper-trunk/2181/)
        ZOOKEEPER-1057. zookeeper c-client, connection to offline server fails to successfully fallback to second zk host (Germán Blanco via michim) (michim: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1556948)

        • /zookeeper/trunk/CHANGES.txt
        • /zookeeper/trunk/src/c/tests/TestClient.cc
        Show
        Hudson added a comment - SUCCESS: Integrated in ZooKeeper-trunk #2181 (See https://builds.apache.org/job/ZooKeeper-trunk/2181/ ) ZOOKEEPER-1057 . zookeeper c-client, connection to offline server fails to successfully fallback to second zk host (Germán Blanco via michim) (michim: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1556948 ) /zookeeper/trunk/CHANGES.txt /zookeeper/trunk/src/c/tests/TestClient.cc
        Hide
        Flavio Junqueira added a comment -

        Closing issues after releasing 3.4.6.

        Show
        Flavio Junqueira added a comment - Closing issues after releasing 3.4.6.

          People

          • Assignee:
            Michi Mutsuzaki
            Reporter:
            Woody Anderson
          • Votes:
            2 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development