ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-770

Slow add_auth calls with multi-threaded client

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 3.3.0, 3.3.3, 3.4.0
    • Fix Version/s: 3.5.0
    • Component/s: c client, contrib-bindings
    • Labels:
      None
    • Environment:

      ubuntu lucid (10.04), zk trunk (3.4)

      Description

      Calls to add_auth are a bit slow from the c client library. The auth callback typically takes multiple seconds to fire. I instrumented the java, c binding, and python binding with a few log statements to find out where the slowness was occuring ( http://bazaar.launchpad.net/~hazmat/zookeeper/fast-auth-instrumented/revision/647). It looks like when the io thread polls, it doesn't register interest in the incoming packet, so the auth success message from the server and the auth callback are only processed when the poll timeouts. I tried modifying mt_adapter.c so the poll registers interest in both events, this causes a considerably more wakeups but it does address the issue of making add_auth fast. I think the ideal solution would be some sort of additional auth handshake state on the handle, that zookeeper_interest could utilize to suggest both POLLIN|POLLOUT are wanted for subsequent calls to poll during the auth handshake handle state.

      i'm attaching a script that takes 13s or 1.6s for the auth callback depending on the session time out value (which in turn figures into the calculation of the poll timeout).

      1. authtest.py
        1 kB
        Kapil Thangavelu
      2. ZOOKEEPER-770-FIX.patch
        0.4 kB
        Craig Calef
      3. ZOOKEEPER-770.patch
        2 kB
        Craig Calef

        Activity

        Hide
        Kapil Thangavelu added a comment -

        script that demonstrates that time for auth callbacks is dependent on session timeout, which is used to calculate poll timeout.

        Show
        Kapil Thangavelu added a comment - script that demonstrates that time for auth callbacks is dependent on session timeout, which is used to calculate poll timeout.
        Hide
        Craig Calef added a comment -

        This bug was also causing me some problems. After digging into the C code it occured to me that the problem was down through add_auth() to send_info_packet(), which was putting the auth packet on the to_send queue but not waking up the IO thread via adaptor_send_queue() like send_ping() does. The result was the auth packet didn't go out until the zoo_interest() timeout occured (which internally looks to be like 1/3rd the timeout specified in the zookeeper_init, but I don't 100% grok what zoo_interest is doing) Attached is a very simple patch which remedies this problem, and does not have any other appreciable impact. For me it cut the time it took for the add_auth watch to fire to practically instantaneous.

        Show
        Craig Calef added a comment - This bug was also causing me some problems. After digging into the C code it occured to me that the problem was down through add_auth() to send_info_packet(), which was putting the auth packet on the to_send queue but not waking up the IO thread via adaptor_send_queue() like send_ping() does. The result was the auth packet didn't go out until the zoo_interest() timeout occured (which internally looks to be like 1/3rd the timeout specified in the zookeeper_init, but I don't 100% grok what zoo_interest is doing) Attached is a very simple patch which remedies this problem, and does not have any other appreciable impact. For me it cut the time it took for the add_auth watch to fire to practically instantaneous.
        Hide
        Craig Calef added a comment -

        Patch for a 1 line change to src/c/src/zookeeper.c

        Show
        Craig Calef added a comment - Patch for a 1 line change to src/c/src/zookeeper.c
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12482761/ZOOKEEPER-770-FIX.patch
        against trunk revision 1135515.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/323//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/323//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/323//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12482761/ZOOKEEPER-770-FIX.patch against trunk revision 1135515. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/323//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/323//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/323//console This message is automatically generated.
        Hide
        Patrick Hunt added a comment -

        @craig jenkins is almost happy – can you update the patch to include the test from Kapil? Possible to add a c test that verifies this as well? Thanks!

        Show
        Patrick Hunt added a comment - @craig jenkins is almost happy – can you update the patch to include the test from Kapil? Possible to add a c test that verifies this as well? Thanks!
        Hide
        Craig Calef added a comment -

        I have a patch I made to TestClient.cc to test for this condition but I can't get zktest-mt to build on MacOSX.

        Show
        Craig Calef added a comment - I have a patch I made to TestClient.cc to test for this condition but I can't get zktest-mt to build on MacOSX.
        Hide
        Patrick Hunt added a comment -

        this one – ZOOKEEPER-463 ? I'm not on a mac, perhaps you could take that up at some point?

        If you want to add your test to the patch jenkins will balk if it has problems, I'd be willing to look into it if that happens.

        Show
        Patrick Hunt added a comment - this one – ZOOKEEPER-463 ? I'm not on a mac, perhaps you could take that up at some point? If you want to add your test to the patch jenkins will balk if it has problems, I'd be willing to look into it if that happens.
        Hide
        Craig Calef added a comment -

        ZOOKEEPER-463 is the one preventing me from building the cppunit tests on MacOSX-- I'll give trying to fix that one a shot. Going to compile & finalize the tests the ZOOKEEPER-770 fix on Linux if I can't.

        Show
        Craig Calef added a comment - ZOOKEEPER-463 is the one preventing me from building the cppunit tests on MacOSX-- I'll give trying to fix that one a shot. Going to compile & finalize the tests the ZOOKEEPER-770 fix on Linux if I can't.
        Hide
        Craig Calef added a comment -

        Without the fix the test I created raises the expected assertion:
        {{
        tests/TestClient.cc:574: Assertion: assertion failed [Expression: calculate_interval(&started, &completion_timestamp) < 2]
        Failures !!!
        Run: 46 Failure total: 1 Failures: 1 Errors: 0
        }}

        With the fix, the tests complete successfully.
        {{
        OK (46)
        }}

        Show
        Craig Calef added a comment - Without the fix the test I created raises the expected assertion: {{ tests/TestClient.cc:574: Assertion: assertion failed [Expression: calculate_interval(&started, &completion_timestamp) < 2] Failures !!! Run: 46 Failure total: 1 Failures: 1 Errors: 0 }} With the fix, the tests complete successfully. {{ OK (46) }}
        Hide
        Craig Calef added a comment -

        Fix + tests

        Show
        Craig Calef added a comment - Fix + tests
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12482882/ZOOKEEPER-770.patch
        against trunk revision 1136740.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/324//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/324//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/324//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12482882/ZOOKEEPER-770.patch against trunk revision 1136740. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/324//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/324//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/324//console This message is automatically generated.
        Hide
        Patrick Hunt added a comment -

        I'm going to add the py auth test to the patch.

        Show
        Patrick Hunt added a comment - I'm going to add the py auth test to the patch.
        Hide
        Patrick Hunt added a comment -

        Hm, I'm held up because I see this error with the patch applied:

        tests/TestClient.cc:574: Assertion: assertion failed [Expression: calculate_interval(&started, &completion_timestamp) < 2]
        Failures !!!
        Run: 46 Failure total: 1 Failures: 1 Errors: 0

        I verified that the change to zookeeper.c is applied in this case. (linux natty w/quad core, 8gb and ssd) I verified that all tests pass with trunk code...

        Show
        Patrick Hunt added a comment - Hm, I'm held up because I see this error with the patch applied: tests/TestClient.cc:574: Assertion: assertion failed [Expression: calculate_interval(&started, &completion_timestamp) < 2] Failures !!! Run: 46 Failure total: 1 Failures: 1 Errors: 0 I verified that the change to zookeeper.c is applied in this case. (linux natty w/quad core, 8gb and ssd) I verified that all tests pass with trunk code...
        Hide
        Patrick Hunt added a comment -

        I updated the code to output the calculate_interval result - over 3 runs it was 6,5,4

        Also, it would be a bad idea to encode this type of test - we stay away (typ) from tests that rely on timing as an indication of pass/fail. The reason for this is we never know what type hardware it might be run on - in particular jenkins environments are often virtualized.

        Is there another way to verify this?

        Does this indicate that the patch is not working on my system?

        Show
        Patrick Hunt added a comment - I updated the code to output the calculate_interval result - over 3 runs it was 6,5,4 Also, it would be a bad idea to encode this type of test - we stay away (typ) from tests that rely on timing as an indication of pass/fail. The reason for this is we never know what type hardware it might be run on - in particular jenkins environments are often virtualized. Is there another way to verify this? Does this indicate that the patch is not working on my system?
        Hide
        Craig Calef added a comment -

        When I was writing the test I had the same concern about the subjectivity of the timing, but couldn't determine how to test for 'auth was send and received immediately' vs. 'took much longer than it should since it wasn't sent immediately but when the api decided to later on after a poll call timed out'

        I'm also not sure why it didn't work for you (the test shows the obviously bad behavior) I'd suggest perhaps doing a 'make clean' and then 'make zktest-mt' but I'm sure you thought of that. Were you running zktest-mt or zktest-st?

        Show
        Craig Calef added a comment - When I was writing the test I had the same concern about the subjectivity of the timing, but couldn't determine how to test for 'auth was send and received immediately' vs. 'took much longer than it should since it wasn't sent immediately but when the api decided to later on after a poll call timed out' I'm also not sure why it didn't work for you (the test shows the obviously bad behavior) I'd suggest perhaps doing a 'make clean' and then 'make zktest-mt' but I'm sure you thought of that. Were you running zktest-mt or zktest-st?
        Hide
        Patrick Hunt added a comment -

        We might just need to ensure that we test the code path(s), but not verify timing in the test (we could output the information to the test log). It would be up to the person fixing this jira to verify, we wouldn't have automatic checking of regression. (the python code is doing the same thing, so that's out as well (for the patch))

        Another option, can we make the typical time (pre patch) very very long (60sec say) and anything less than that (post patch test) success? That's typically fine even for the slowest of test machines...

        I did try clean (etc...) This was with zktest-mt, I'm not sure why my machine is seeing this... I'm kinda stumped. I might need to look at it in more depth, but unfortunately I don't have the time right now. Any ideas?

        Show
        Patrick Hunt added a comment - We might just need to ensure that we test the code path(s), but not verify timing in the test (we could output the information to the test log). It would be up to the person fixing this jira to verify, we wouldn't have automatic checking of regression. (the python code is doing the same thing, so that's out as well (for the patch)) Another option, can we make the typical time (pre patch) very very long (60sec say) and anything less than that (post patch test) success? That's typically fine even for the slowest of test machines... I did try clean (etc...) This was with zktest-mt, I'm not sure why my machine is seeing this... I'm kinda stumped. I might need to look at it in more depth, but unfortunately I don't have the time right now. Any ideas?
        Hide
        Craig Calef added a comment -

        Without looking at your setup it would be hard to say, since it worked perfectly for me.

        The timing threshold could be set to 1/3rd the connection timeout. This is how long it typically takes without the patch. This is how long it takes when it is 'slow' based on what is going on inside zookeeper_interest() with its "deadlines." The tests have no configuration to account for 'slow' machines, regardless, and the timeout is always set to 10 seconds (and why I thought 2 seconds was reasonable)

        Show
        Craig Calef added a comment - Without looking at your setup it would be hard to say, since it worked perfectly for me. The timing threshold could be set to 1/3rd the connection timeout. This is how long it typically takes without the patch. This is how long it takes when it is 'slow' based on what is going on inside zookeeper_interest() with its "deadlines." The tests have no configuration to account for 'slow' machines, regardless, and the timeout is always set to 10 seconds (and why I thought 2 seconds was reasonable)
        Hide
        Mahadev konar added a comment -

        Moving this out to 3.5. Not a blocker.

        Show
        Mahadev konar added a comment - Moving this out to 3.5. Not a blocker.
        Hide
        Kapil Thangavelu added a comment -

        anyway we can get this one line patch in for zookeeper 3.4 series? its basically preventing writing units tests for apps that make use of the c bindings and auth (a test suite takes forever without the patch).

        Show
        Kapil Thangavelu added a comment - anyway we can get this one line patch in for zookeeper 3.4 series? its basically preventing writing units tests for apps that make use of the c bindings and auth (a test suite takes forever without the patch).

          People

          • Assignee:
            Craig Calef
            Reporter:
            Kapil Thangavelu
          • Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:

              Development