ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-542

c-client can spin when server unresponsive

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.2.0, 3.2.1
    • Fix Version/s: 3.3.0
    • Component/s: c client
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Due to a mismatch between zookeeper_interest() and zookeeper_process(), when the zookeeper server is unresponsive the client can spin when reconnecting to the server.

      In particular, zookeeper_interest() adds ZOOKEEPER_WRITE whenever there is data to be sent, but flush_send_queue() only writes the data if the state is ZOO_CONNECTED_STATE. When in ZOO_ASSOCIATING_STATE, this results in spinning.

      This probably doesn't affect production, but I had a runaway process in a development deployment that caused performance issues on the node. This is easy to reproduce in a single node environment by doing a kill -STOP on the server and waiting for the session timeout.

      Patch to be added.

      1. ZOOKEEPER-542.patch
        0.5 kB
        Christian Wiedmann
      2. ZOOKEEPER-542.patch
        2 kB
        Benjamin Reed

        Activity

        Hide
        Benjamin Reed added a comment -

        +1 good catch and good fix. i'm going to extend the patch slightly by putting in a comment to document how we are handling the non-blocking connect. (somehow that got deleted long ago.)

        Show
        Benjamin Reed added a comment - +1 good catch and good fix. i'm going to extend the patch slightly by putting in a comment to document how we are handling the non-blocking connect. (somehow that got deleted long ago.)
        Hide
        Benjamin Reed added a comment -

        added comments

        Show
        Benjamin Reed added a comment - added comments
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12421444/ZOOKEEPER-542.patch
        against trunk revision 822065.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no tests are needed for this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/17/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/17/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/17/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12421444/ZOOKEEPER-542.patch against trunk revision 822065. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/17/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/17/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/17/console This message is automatically generated.
        Hide
        Patrick Hunt added a comment -

        Is a test possible here? It would be great to have one to verify the fix.

        Show
        Patrick Hunt added a comment - Is a test possible here? It would be great to have one to verify the fix.
        Hide
        Christian Wiedmann added a comment -

        I don't really know how to do an automated test for this, since the spinning is not visible outside of the API. The manual test I used is to kill -STOP the server and then wait until the client tries to reconnect while running strace on the I/O thread (I'm using python bindings, btw). Pre-patch the strace shows repeated calls to poll, with POLLOUT set on the server fd. Post-patch, POLLOUT is not set, and there is no spinning.

        Show
        Christian Wiedmann added a comment - I don't really know how to do an automated test for this, since the spinning is not visible outside of the API. The manual test I used is to kill -STOP the server and then wait until the client tries to reconnect while running strace on the I/O thread (I'm using python bindings, btw). Pre-patch the strace shows repeated calls to poll, with POLLOUT set on the server fd. Post-patch, POLLOUT is not set, and there is no spinning.
        Hide
        Mahadev konar added a comment -

        +1 for the patch.... it would be really hard to write a test for this since we would have to have a server in which a connect does not complete (and also does not error out soon)... which can be done via SIGSTOP but would be rather hard to do in a automated test.

        Show
        Mahadev konar added a comment - +1 for the patch.... it would be really hard to write a test for this since we would have to have a server in which a connect does not complete (and also does not error out soon)... which can be done via SIGSTOP but would be rather hard to do in a automated test.
        Hide
        Mahadev konar added a comment -

        making it PA for now... pat agrees that its a hard to test jira..

        Show
        Mahadev konar added a comment - making it PA for now... pat agrees that its a hard to test jira..
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12421444/ZOOKEEPER-542.patch
        against trunk revision 822065.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no tests are needed for this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/18/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/18/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/18/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12421444/ZOOKEEPER-542.patch against trunk revision 822065. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/18/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/18/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/18/console This message is automatically generated.
        Hide
        Mahadev konar added a comment -

        I just committed this... thanks christian!

        Show
        Mahadev konar added a comment - I just committed this... thanks christian!
        Hide
        Hudson added a comment -

        Integrated in ZooKeeper-trunk #491 (See http://hudson.zones.apache.org/hudson/job/ZooKeeper-trunk/491/)
        . c-client can spin when server unresponsive (Christian Wiedmann via mahadev)

        Show
        Hudson added a comment - Integrated in ZooKeeper-trunk #491 (See http://hudson.zones.apache.org/hudson/job/ZooKeeper-trunk/491/ ) . c-client can spin when server unresponsive (Christian Wiedmann via mahadev)

          People

          • Assignee:
            Christian Wiedmann
            Reporter:
            Christian Wiedmann
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development