ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-458

connect_index in zookeeper handle might get out of bound.

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 3.6.0
    • Component/s: c client
    • Labels:
      None

      Description

      connect_index in zookeeper handle might get out of bound. the zokoeeper_init method checks for index == count and sets it to zero. If the index becomes greater than count, then it will go out of bounds.

      1. ZOOKEEPER-458.patch
        0.4 kB
        Steven Cheng
      2. ZOOKEEPER-458.patch
        3 kB
        Steven Cheng
      3. ZOOKEEPER-458.patch
        3 kB
        Steven Cheng
      4. ZOOKEEPER-458.patch
        3 kB
        Steven Cheng
      5. ZOOKEEPER-458.patch
        3 kB
        Steven Cheng
      6. ZOOKEEPER-458.patch
        3 kB
        Steven Cheng
      7. ZOOKEEPER-458.patch
        3 kB
        Steven Cheng

        Activity

        Hide
        Mahadev konar added a comment -

        moving it to 3.3 since its not a regression and the connect_index will always be sane but the code just needs to put in extra checks for that.

        Show
        Mahadev konar added a comment - moving it to 3.3 since its not a regression and the connect_index will always be sane but the code just needs to put in extra checks for that.
        Hide
        Steven Cheng added a comment -

        This patch adds a check that resets connect_index at the only point where connect_index gets non-zero values.

        More checks could be added if more places will be added that update connect_index or to handle addrs_count changing.

        Show
        Steven Cheng added a comment - This patch adds a check that resets connect_index at the only point where connect_index gets non-zero values. More checks could be added if more places will be added that update connect_index or to handle addrs_count changing.
        Hide
        Mahadev konar added a comment -

        +1 the patch looks good ...

        steven, would it be possible to add tests to it?

        Show
        Mahadev konar added a comment - +1 the patch looks good ... steven, would it be possible to add tests to it?
        Hide
        Steven Cheng added a comment -

        I think so... I would need to force handle_error to get called to bump up connect_index. Any tips on the best way to do this?

        Looks like handle_error gets called from either a socket error or if there's an auth fail.

        Show
        Steven Cheng added a comment - I think so... I would need to force handle_error to get called to bump up connect_index. Any tips on the best way to do this? Looks like handle_error gets called from either a socket error or if there's an auth fail.
        Hide
        Steven Cheng added a comment -

        Tried to make some test cases that would trigger just one handle_error(), seemed like the easiest way would be to have zk connect to a non-existing server and try an operation.

        The first test gives me a glibc error when I run it

            • glibc detected *** ./zktest-st: corrupted double-linked list: 0x08dc5d70 ***
              ======= Backtrace: =========
              /lib/tls/i686/cmov/libc.so.6[0x401d9ff1]
              /lib/tls/i686/cmov/libc.so.6[0x401db899]
              /lib/tls/i686/cmov/libc.so.6(cfree+0x6d)[0x401de79d]
              [snip]
              [exec] make: *** [run-check] Aborted
              [exec] Zookeeper_operations::testOperationConnectionLoss1 : assertion

        This seems like a bug. I'll look into it some more later.

        Show
        Steven Cheng added a comment - Tried to make some test cases that would trigger just one handle_error(), seemed like the easiest way would be to have zk connect to a non-existing server and try an operation. The first test gives me a glibc error when I run it glibc detected *** ./zktest-st: corrupted double-linked list: 0x08dc5d70 *** ======= Backtrace: ========= /lib/tls/i686/cmov/libc.so.6 [0x401d9ff1] /lib/tls/i686/cmov/libc.so.6 [0x401db899] /lib/tls/i686/cmov/libc.so.6(cfree+0x6d) [0x401de79d] [snip] [exec] make: *** [run-check] Aborted [exec] Zookeeper_operations::testOperationConnectionLoss1 : assertion This seems like a bug. I'll look into it some more later.
        Hide
        Steven Cheng added a comment -

        This patch ensures that 0 <= connect_index < addrs_count whenever connect_index is set to a non-zero value.

        Includes two tests that this property is maintained over disconnects.

        Show
        Steven Cheng added a comment - This patch ensures that 0 <= connect_index < addrs_count whenever connect_index is set to a non-zero value. Includes two tests that this property is maintained over disconnects.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12425744/ZOOKEEPER-458.patch
        against trunk revision 882744.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/72/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/72/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/72/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12425744/ZOOKEEPER-458.patch against trunk revision 882744. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/72/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/72/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/72/console This message is automatically generated.
        Hide
        Mahadev konar added a comment -

        +1... good to see the tests. the test failure with hudson is a known issue.

        Show
        Mahadev konar added a comment - +1... good to see the tests. the test failure with hudson is a known issue.
        Hide
        Mahadev konar added a comment -

        steven,
        I tried running the ccpunit tests with your patch and it core dumps at testOperationConnectionLoss1 (the test you added)....

        Show
        Mahadev konar added a comment - steven, I tried running the ccpunit tests with your patch and it core dumps at testOperationConnectionLoss1 (the test you added)....
        Hide
        Mahadev konar added a comment -

        pressed enter too soon. is it passing for you? Its failing for me consistently. I can help debugging if its passing for you. let me know.

        Show
        Mahadev konar added a comment - pressed enter too soon. is it passing for you? Its failing for me consistently. I can help debugging if its passing for you. let me know.
        Hide
        Steven Cheng added a comment -

        Sorry Mahadev, for some reason an old patch got uploaded instead. I'll make sure to double-check next time.

        I changed the tests so that a server was available and then disconnects, these ones pass for me.

        Show
        Steven Cheng added a comment - Sorry Mahadev, for some reason an old patch got uploaded instead. I'll make sure to double-check next time. I changed the tests so that a server was available and then disconnects, these ones pass for me.
        Hide
        Steven Cheng added a comment -

        Now I'm wondering how the old patch passed? Strange.

        Show
        Steven Cheng added a comment - Now I'm wondering how the old patch passed? Strange.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12426121/ZOOKEEPER-458.patch
        against trunk revision 882744.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/76/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/76/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/76/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12426121/ZOOKEEPER-458.patch against trunk revision 882744. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/76/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/76/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/76/console This message is automatically generated.
        Hide
        Mahadev konar added a comment -

        the old patch didnt run the cppunit tests since it had already failed at the java tests. the patch you just uploaded failed at the testConnectConnectIndex1 (segfaulted). you can take a look at

        http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/76/console

        are the cppunit tests passing for you?

        Show
        Mahadev konar added a comment - the old patch didnt run the cppunit tests since it had already failed at the java tests. the patch you just uploaded failed at the testConnectConnectIndex1 (segfaulted). you can take a look at http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/76/console are the cppunit tests passing for you?
        Hide
        Steven Cheng added a comment -

        These ones are passing for me, and still pass after svn update.

        Can't tell where it segfaults, it should have connected to the server if the zoo_exists returns ZOK?

        Show
        Steven Cheng added a comment - These ones are passing for me, and still pass after svn update. Can't tell where it segfaults, it should have connected to the server if the zoo_exists returns ZOK?
        Hide
        Steven Cheng added a comment -

        Any debugging help you can give would be great Mahadev.

        Show
        Steven Cheng added a comment - Any debugging help you can give would be great Mahadev.
        Hide
        Mahadev konar added a comment -

        the tests pass for me on linux. ill try hudson again to see if the failure is consistent or not.

        Show
        Mahadev konar added a comment - the tests pass for me on linux. ill try hudson again to see if the failure is consistent or not.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12426121/ZOOKEEPER-458.patch
        against trunk revision 888216.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h7.grid.sp2.yahoo.net/20/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h7.grid.sp2.yahoo.net/20/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h7.grid.sp2.yahoo.net/20/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12426121/ZOOKEEPER-458.patch against trunk revision 888216. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h7.grid.sp2.yahoo.net/20/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h7.grid.sp2.yahoo.net/20/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h7.grid.sp2.yahoo.net/20/console This message is automatically generated.
        Hide
        Mahadev konar added a comment -

        I spent some time debugging this. This is the stack trace from the core dump:

        #0 0x00002b101d67f6e3 in ?? () from /lib/libc.so.6
        #1 0x00002b101d680e98 in ?? () from /lib/libc.so.6
        #2 0x00002b101d681276 in free () from /lib/libc.so.6
        #3 0x0000000000407b4e in __wrap_free (p=0xc03d80) at /homes/mahadev/zookeeper-trunk/src/c/tests/LibCMocks.cc:197
        #4 0x0000000000441197 in free_buffer (b=0xbff8d0) at /homes/mahadev/zookeeper-trunk/src/c/src/zookeeper.c:763
        #5 0x0000000000441aa5 in destroy_completion_entry (c=0xc02dd0) at /homes/mahadev/zookeeper-trunk/src/c/src/zookeeper.c:2048
        #6 0x0000000000441dc2 in process_completions (zh=0xc02920) at /homes/mahadev/zookeeper-trunk/src/c/src/zookeeper.c:1727
        #7 0x00000000004457a8 in zookeeper_process (zh=0xc02920, events=-1912149552) at /homes/mahadev/zookeeper-trunk/src/c/src/zookeeper.c:1974
        #8 0x00000000004316dd in yield (zh=0xc02920, seconds=1) at /homes/mahadev/zookeeper-trunk/src/c/tests/TestClient.cc:107
        #9 0x0000000000433406 in watchCtx::waitForConnected (this=0x7fff8e06f100, zh=0xc02920) at /homes/mahadev/zookeeper-trunk/src/c/tests/TestClient.cc:165
        #10 0x000000000043448d in Zookeeper_simpleSystem::testConnectIndex1 (this=0xbfdaa0) at /homes/mahadev/zookeeper-trunk/src/c/tests/TestClient.cc:846
        #11 0x0000000000431a02 in CppUnit::TestCaller<Zookeeper_simpleSystem>::runTest (this=0xbfdd00) at /usr/include/cppunit/TestCaller.h:166
        #12 0x00000000004566fa in CppUnit::TestCaseMethodFunctor::operator() ()
        #13 0x00000000004618e4 in CppUnit::DefaultProtector::protect ()
        #14 0x000000000046315f in CppUnit::ProtectorChain::protect ()
        #15 0x000000000045e1b2 in CppUnit::TestResult::protect ()
        #16 0x00000000004564ba in CppUnit::TestCase::run ()
        #17 0x0000000000463d03 in CppUnit::TestComposite::doRunChildTests ()
        #18 0x0000000000463c26 in CppUnit::TestComposite::run ()
        #19 0x0000000000463d03 in CppUnit::TestComposite::doRunChildTests ()
        #20 0x0000000000463c26 in CppUnit::TestComposite::run ()
        #21 0x000000000045da0a in CppUnit::TestResult::runTest ()
        #22 0x000000000045ff62 in CppUnit::TestRunner::run ()

        I tried debugging but could not find much time.

        Steven,
        can you take a look and see if you find something obvious?

        Show
        Mahadev konar added a comment - I spent some time debugging this. This is the stack trace from the core dump: #0 0x00002b101d67f6e3 in ?? () from /lib/libc.so.6 #1 0x00002b101d680e98 in ?? () from /lib/libc.so.6 #2 0x00002b101d681276 in free () from /lib/libc.so.6 #3 0x0000000000407b4e in __wrap_free (p=0xc03d80) at /homes/mahadev/zookeeper-trunk/src/c/tests/LibCMocks.cc:197 #4 0x0000000000441197 in free_buffer (b=0xbff8d0) at /homes/mahadev/zookeeper-trunk/src/c/src/zookeeper.c:763 #5 0x0000000000441aa5 in destroy_completion_entry (c=0xc02dd0) at /homes/mahadev/zookeeper-trunk/src/c/src/zookeeper.c:2048 #6 0x0000000000441dc2 in process_completions (zh=0xc02920) at /homes/mahadev/zookeeper-trunk/src/c/src/zookeeper.c:1727 #7 0x00000000004457a8 in zookeeper_process (zh=0xc02920, events=-1912149552) at /homes/mahadev/zookeeper-trunk/src/c/src/zookeeper.c:1974 #8 0x00000000004316dd in yield (zh=0xc02920, seconds=1) at /homes/mahadev/zookeeper-trunk/src/c/tests/TestClient.cc:107 #9 0x0000000000433406 in watchCtx::waitForConnected (this=0x7fff8e06f100, zh=0xc02920) at /homes/mahadev/zookeeper-trunk/src/c/tests/TestClient.cc:165 #10 0x000000000043448d in Zookeeper_simpleSystem::testConnectIndex1 (this=0xbfdaa0) at /homes/mahadev/zookeeper-trunk/src/c/tests/TestClient.cc:846 #11 0x0000000000431a02 in CppUnit::TestCaller<Zookeeper_simpleSystem>::runTest (this=0xbfdd00) at /usr/include/cppunit/TestCaller.h:166 #12 0x00000000004566fa in CppUnit::TestCaseMethodFunctor::operator() () #13 0x00000000004618e4 in CppUnit::DefaultProtector::protect () #14 0x000000000046315f in CppUnit::ProtectorChain::protect () #15 0x000000000045e1b2 in CppUnit::TestResult::protect () #16 0x00000000004564ba in CppUnit::TestCase::run () #17 0x0000000000463d03 in CppUnit::TestComposite::doRunChildTests () #18 0x0000000000463c26 in CppUnit::TestComposite::run () #19 0x0000000000463d03 in CppUnit::TestComposite::doRunChildTests () #20 0x0000000000463c26 in CppUnit::TestComposite::run () #21 0x000000000045da0a in CppUnit::TestResult::runTest () #22 0x000000000045ff62 in CppUnit::TestRunner::run () I tried debugging but could not find much time. Steven, can you take a look and see if you find something obvious?
        Hide
        Steven Cheng added a comment -

        Maybe free_buffer is getting called twice on the same structure, changed patch to null out the buffer field.

        If free_buffer is getting called twice, we should see a __wrap_free ... p = 0 in the backtrace.

        It's also possible that the buffer is getting free'd since buffer is shared with the iarchive zookeeper.c:1781 but I couldn't find any paths where the iarchive buffer is free'd by manually tracing through.

        One thing I am confused about is that the segfault happens at the end of the testConnectIndex1 test, but the path that it is taking is processing outstanding synchronous completions. The only synchronous completion that could be there is the zoo_exists call, but this was completed at the beginning of the test, before the server was stopped.

        Show
        Steven Cheng added a comment - Maybe free_buffer is getting called twice on the same structure, changed patch to null out the buffer field. If free_buffer is getting called twice, we should see a __wrap_free ... p = 0 in the backtrace. It's also possible that the buffer is getting free'd since buffer is shared with the iarchive zookeeper.c:1781 but I couldn't find any paths where the iarchive buffer is free'd by manually tracing through. One thing I am confused about is that the segfault happens at the end of the testConnectIndex1 test, but the path that it is taking is processing outstanding synchronous completions. The only synchronous completion that could be there is the zoo_exists call, but this was completed at the beginning of the test, before the server was stopped.
        Hide
        Mahadev konar added a comment -

        I tried running the new patch on the hudson machine. I get a core dump but this time there is no trace of zookeeper internal calls. This looks very fishy!

        #0 0x00002ab7e7ead225 in free () from /lib/libc.so.6
        #1 0x00000000004053fb in __gnu_cxx::new_allocator<std::string>::deallocate (this=0x7fffc3845648, __p=0x3e700000000) at /usr/include/c++/4.3/ext/new_allocator.h:98
        #2 0x0000000000405855 in std::_Deque_base<std::string, std::allocator<std::string> >::_M_deallocate_node (this=0x7fffc3845648, __p=0x3e700000000) at /usr/include/c++/4.3/bits/stl_deque.h:454
        #3 0x0000000000405886 in std::_Deque_base<std::string, std::allocator<std::string> >::_M_destroy_nodes (this=0x7fffc3845648, __nstart=0x13d3368, __nfinish=0x13d3370) at /usr/include/c++/4.3/bits/stl_deque.h:557
        #4 0x000000000040592b in ~_Deque_base (this=0x7fffc3845648) at /usr/include/c++/4.3/bits/stl_deque.h:480
        #5 0x00000000004059a8 in ~deque (this=0x7fffc3845648) at /usr/include/c++/4.3/bits/stl_deque.h:776
        #6 0x00000000004059fa in ~Message (this=0x7fffc3845640) at /usr/include/cppunit/Message.h:39
        #7 0x00000000004351bc in Zookeeper_simpleSystem::testConnectIndex1 (this=0x13cede0) at /homes/mahadev/zookeeper-trunk/src/c/tests/TestClient.cc:846
        #8 0x0000000000431cc6 in CppUnit::TestCaller<Zookeeper_simpleSystem>::runTest (this=0x13cf040) at /usr/include/cppunit/TestCaller.h:166
        #9 0x000000000045bb4a in CppUnit::TestCaseMethodFunctor::operator() ()
        #10 0x0000000000466d34 in CppUnit::DefaultProtector::protect ()
        #11 0x00000000004685af in CppUnit::ProtectorChain::protect ()
        #12 0x0000000000463602 in CppUnit::TestResult::protect ()
        #13 0x000000000045b90a in CppUnit::TestCase::run ()
        #14 0x0000000000469153 in CppUnit::TestComposite::doRunChildTests ()
        #15 0x0000000000469076 in CppUnit::TestComposite::run ()
        #16 0x0000000000469153 in CppUnit::TestComposite::doRunChildTests ()
        #17 0x0000000000469076 in CppUnit::TestComposite::run ()
        #18 0x0000000000462e5a in CppUnit::TestResult::runTest ()
        #19 0x00000000004653b2 in CppUnit::TestRunner::run ()
        #20 0x00000000004047ea in main (argc=1, argv=0x7fffc38462d8) at /homes/mahadev/zookeeper-trunk/src/c/tests/TestDriver.cc:152

        Show
        Mahadev konar added a comment - I tried running the new patch on the hudson machine. I get a core dump but this time there is no trace of zookeeper internal calls. This looks very fishy! #0 0x00002ab7e7ead225 in free () from /lib/libc.so.6 #1 0x00000000004053fb in __gnu_cxx::new_allocator<std::string>::deallocate (this=0x7fffc3845648, __p=0x3e700000000) at /usr/include/c++/4.3/ext/new_allocator.h:98 #2 0x0000000000405855 in std::_Deque_base<std::string, std::allocator<std::string> >::_M_deallocate_node (this=0x7fffc3845648, __p=0x3e700000000) at /usr/include/c++/4.3/bits/stl_deque.h:454 #3 0x0000000000405886 in std::_Deque_base<std::string, std::allocator<std::string> >::_M_destroy_nodes (this=0x7fffc3845648, __nstart=0x13d3368, __nfinish=0x13d3370) at /usr/include/c++/4.3/bits/stl_deque.h:557 #4 0x000000000040592b in ~_Deque_base (this=0x7fffc3845648) at /usr/include/c++/4.3/bits/stl_deque.h:480 #5 0x00000000004059a8 in ~deque (this=0x7fffc3845648) at /usr/include/c++/4.3/bits/stl_deque.h:776 #6 0x00000000004059fa in ~Message (this=0x7fffc3845640) at /usr/include/cppunit/Message.h:39 #7 0x00000000004351bc in Zookeeper_simpleSystem::testConnectIndex1 (this=0x13cede0) at /homes/mahadev/zookeeper-trunk/src/c/tests/TestClient.cc:846 #8 0x0000000000431cc6 in CppUnit::TestCaller<Zookeeper_simpleSystem>::runTest (this=0x13cf040) at /usr/include/cppunit/TestCaller.h:166 #9 0x000000000045bb4a in CppUnit::TestCaseMethodFunctor::operator() () #10 0x0000000000466d34 in CppUnit::DefaultProtector::protect () #11 0x00000000004685af in CppUnit::ProtectorChain::protect () #12 0x0000000000463602 in CppUnit::TestResult::protect () #13 0x000000000045b90a in CppUnit::TestCase::run () #14 0x0000000000469153 in CppUnit::TestComposite::doRunChildTests () #15 0x0000000000469076 in CppUnit::TestComposite::run () #16 0x0000000000469153 in CppUnit::TestComposite::doRunChildTests () #17 0x0000000000469076 in CppUnit::TestComposite::run () #18 0x0000000000462e5a in CppUnit::TestResult::runTest () #19 0x00000000004653b2 in CppUnit::TestRunner::run () #20 0x00000000004047ea in main (argc=1, argv=0x7fffc38462d8) at /homes/mahadev/zookeeper-trunk/src/c/tests/TestDriver.cc:152
        Hide
        Steven Cheng added a comment -

        Really looks like something with free... maybe this patch will be able to catch it better... marks the buffer node by setting len=-1 and checks it later.

        Show
        Steven Cheng added a comment - Really looks like something with free... maybe this patch will be able to catch it better... marks the buffer node by setting len=-1 and checks it later.
        Hide
        Steven Cheng added a comment -

        This patch, rather.

        Show
        Steven Cheng added a comment - This patch, rather.
        Hide
        Mahadev konar added a comment -

        it still fails in the same place and with the same back trace:

        0 0x00002adc6886c225 in free () from /lib/libc.so.6
        #1 0x00000000004053fb in __gnu_cxx::new_allocator<std::string>::deallocate (this=0x7fff42e84c88, __p=0x3e700000000) at /usr/include/c++/4.3/ext/new_allocator.h:98
        #2 0x0000000000405855 in std::_Deque_base<std::string, std::allocator<std::string> >::_M_deallocate_node (this=0x7fff42e84c88, __p=0x3e700000000) at /usr/include/c++/4.3/bits/stl_deque.h:454
        #3 0x0000000000405886 in std::_Deque_base<std::string, std::allocator<std::string> >::_M_destroy_nodes (this=0x7fff42e84c88, __nstart=0x1854368, __nfinish=0x1854370)
        at /usr/include/c++/4.3/bits/stl_deque.h:557
        #4 0x000000000040592b in ~_Deque_base (this=0x7fff42e84c88) at /usr/include/c++/4.3/bits/stl_deque.h:480
        #5 0x00000000004059a8 in ~deque (this=0x7fff42e84c88) at /usr/include/c++/4.3/bits/stl_deque.h:776
        #6 0x00000000004059fa in ~Message (this=0x7fff42e84c80) at /usr/include/cppunit/Message.h:39
        #7 0x00000000004351bc in Zookeeper_simpleSystem::testConnectIndex1 (this=0x184fde0) at /homes/mahadev/zookeeper-trunk/src/c/tests/TestClient.cc:846
        #8 0x0000000000431cc6 in CppUnit::TestCaller<Zookeeper_simpleSystem>::runTest (this=0x1850040) at /usr/include/cppunit/TestCaller.h:166
        #9 0x000000000045bb6a in CppUnit::TestCaseMethodFunctor::operator() ()
        #10 0x0000000000466d54 in CppUnit::DefaultProtector::protect ()
        #11 0x00000000004685cf in CppUnit::ProtectorChain::protect ()
        #12 0x0000000000463622 in CppUnit::TestResult::protect ()
        #13 0x000000000045b92a in CppUnit::TestCase::run ()
        #14 0x0000000000469173 in CppUnit::TestComposite::doRunChildTests ()
        #15 0x0000000000469096 in CppUnit::TestComposite::run ()
        #16 0x0000000000469173 in CppUnit::TestComposite::doRunChildTests ()
        #17 0x0000000000469096 in CppUnit::TestComposite::run ()
        #18 0x0000000000462e7a in CppUnit::TestResult::runTest ()
        #19 0x00000000004653d2 in CppUnit::TestRunner::run ()
        #20 0x00000000004047ea in main (argc=1, argv=0x7fff42e85918) at /homes/mahadev/zookeeper-trunk/src/c/tests/TestDriver.cc:152

        Show
        Mahadev konar added a comment - it still fails in the same place and with the same back trace: 0 0x00002adc6886c225 in free () from /lib/libc.so.6 #1 0x00000000004053fb in __gnu_cxx::new_allocator<std::string>::deallocate (this=0x7fff42e84c88, __p=0x3e700000000) at /usr/include/c++/4.3/ext/new_allocator.h:98 #2 0x0000000000405855 in std::_Deque_base<std::string, std::allocator<std::string> >::_M_deallocate_node (this=0x7fff42e84c88, __p=0x3e700000000) at /usr/include/c++/4.3/bits/stl_deque.h:454 #3 0x0000000000405886 in std::_Deque_base<std::string, std::allocator<std::string> >::_M_destroy_nodes (this=0x7fff42e84c88, __nstart=0x1854368, __nfinish=0x1854370) at /usr/include/c++/4.3/bits/stl_deque.h:557 #4 0x000000000040592b in ~_Deque_base (this=0x7fff42e84c88) at /usr/include/c++/4.3/bits/stl_deque.h:480 #5 0x00000000004059a8 in ~deque (this=0x7fff42e84c88) at /usr/include/c++/4.3/bits/stl_deque.h:776 #6 0x00000000004059fa in ~Message (this=0x7fff42e84c80) at /usr/include/cppunit/Message.h:39 #7 0x00000000004351bc in Zookeeper_simpleSystem::testConnectIndex1 (this=0x184fde0) at /homes/mahadev/zookeeper-trunk/src/c/tests/TestClient.cc:846 #8 0x0000000000431cc6 in CppUnit::TestCaller<Zookeeper_simpleSystem>::runTest (this=0x1850040) at /usr/include/cppunit/TestCaller.h:166 #9 0x000000000045bb6a in CppUnit::TestCaseMethodFunctor::operator() () #10 0x0000000000466d54 in CppUnit::DefaultProtector::protect () #11 0x00000000004685cf in CppUnit::ProtectorChain::protect () #12 0x0000000000463622 in CppUnit::TestResult::protect () #13 0x000000000045b92a in CppUnit::TestCase::run () #14 0x0000000000469173 in CppUnit::TestComposite::doRunChildTests () #15 0x0000000000469096 in CppUnit::TestComposite::run () #16 0x0000000000469173 in CppUnit::TestComposite::doRunChildTests () #17 0x0000000000469096 in CppUnit::TestComposite::run () #18 0x0000000000462e7a in CppUnit::TestResult::runTest () #19 0x00000000004653d2 in CppUnit::TestRunner::run () #20 0x00000000004047ea in main (argc=1, argv=0x7fff42e85918) at /homes/mahadev/zookeeper-trunk/src/c/tests/TestDriver.cc:152
        Hide
        Steven Cheng added a comment -

        I'm out of ideas for now, can't find anything obvious Mahadev, I will think about it some more.

        Show
        Steven Cheng added a comment - I'm out of ideas for now, can't find anything obvious Mahadev, I will think about it some more.
        Hide
        Mahadev konar added a comment -

        I tried increasing the timer on waitForConnected to 20 seconds on so and it seems to pass the tests 4-5 times in a row. So does look like a race condition somewhere.

        
             bool waitForConnected(zhandle_t *zh) {
        -        time_t expires = time(0) + 10;
        +        time_t expires = time(0) + 20;
                 while(!connected && time(0) < expires) {
                     yield(zh, 1);
        
        

        But all the other tests work with a timeout of 10 seconds, so I am really suspicious of the timer being the problem.

        Show
        Mahadev konar added a comment - I tried increasing the timer on waitForConnected to 20 seconds on so and it seems to pass the tests 4-5 times in a row. So does look like a race condition somewhere. bool waitForConnected(zhandle_t *zh) { - time_t expires = time(0) + 10; + time_t expires = time(0) + 20; while (!connected && time(0) < expires) { yield(zh, 1); But all the other tests work with a timeout of 10 seconds, so I am really suspicious of the timer being the problem.
        Hide
        Steven Cheng added a comment -

        Hmm.. I tried with a lower timeout on my computer and couldn't get the failure. I'll look around the timer stuff.

        Show
        Steven Cheng added a comment - Hmm.. I tried with a lower timeout on my computer and couldn't get the failure. I'll look around the timer stuff.
        Hide
        Patrick Hunt added a comment -

        Have you tried running with valgrind?

        Show
        Patrick Hunt added a comment - Have you tried running with valgrind?
        Hide
        Steven Cheng added a comment -

        Valgrind found some interesting things:

        ==2357==
        ==2357== Invalid write of size 4
        ==2357== at 0x8080C9C: zookeeper_process (zookeeper.c:1900)
        ==2357== by 0x806CAD3: yield(_zhandle*, int) (TestClient.cc:107)
        ==2357== by 0x806CF72: watchCtx::waitForConnected(_zhandle*) (TestClient.cc:165)
        ==2357== by 0x8070402: Zookeeper_simpleSystem::testConnectIndex1() (TestClient.cc:846)
        ==2357== by 0x80727BF: CppUnit::TestCaller<Zookeeper_simpleSystem>::runTest() (TestCaller.h:166)
        ==2357== by 0x8091528: CppUnit::TestCaseMethodFunctor::operator()() const (in /home/steven/workspace/zookeeper-trunk-working/build/test/test-cppunit/zktest-st)
        ==2357== by 0x809BAAC: CppUnit::DefaultProtector::protect(CppUnit::Functor const&, CppUnit::ProtectorContext const&) (in /home/steven/workspace/zookeeper-trunk-working/build/test/test-cppunit/zktest-st)
        ==2357== by 0x809D402: CppUnit::ProtectorChain::ProtectFunctor::operator()() const (in /home/steven/workspace/zookeeper-trunk-working/build/test/test-cppunit/zktest-st)
        ==2357== by 0x809D0C1: CppUnit::ProtectorChain::protect(CppUnit::Functor const&, CppUnit::ProtectorContext const&) (in /home/steven/workspace/zookeeper-trunk-working/build/test/test-cppunit/zktest-st)
        ==2357== by 0x8098705: CppUnit::TestResult::protect(CppUnit::Functor const&, CppUnit::Test*, std::string const&) (in /home/steven/workspace/zookeeper-trunk-working/build/test/test-cppunit/zktest-st)
        ==2357== by 0x809131E: CppUnit::TestCase::run(CppUnit::TestResult*) (in /home/steven/workspace/zookeeper-trunk-working/build/test/test-cppunit/zktest-st)
        ==2357== by 0x809DBB2: CppUnit::TestComposite::doRunChildTests(CppUnit::TestResult*) (in /home/steven/workspace/zookeeper-trunk-working/build/test/test-cppunit/zktest-st)
        ==2357== Address 0x44db5f4 is 4 bytes inside a block of size 84 free'd
        ==2357== at 0x4024836: free (vg_replace_malloc.c:325)
        ==2357== by 0x804E04B: __wrap_free (LibCMocks.cc:197)
        ==2357== by 0x8087B5C: free_sync_completion (st_adaptor.c:56)
        ==2357== by 0x807E1FC: zoo_wexists (zookeeper.c:2909)
        ==2357== by 0x807E289: zoo_exists (zookeeper.c:2890)
        ==2357== by 0x806FDF6: Zookeeper_simpleSystem::testConnectIndex1() (TestClient.cc:840)

        There are a number that looks like this. In this case it looks like the sync_completion allocated by zoo_wexists is sticking around somewhere after zoo_wexists finishes and free()s it, and gets written to by the zookeeper_process() that gets called by the yield() in waitForConnected() .

        Show
        Steven Cheng added a comment - Valgrind found some interesting things: ==2357== ==2357== Invalid write of size 4 ==2357== at 0x8080C9C: zookeeper_process (zookeeper.c:1900) ==2357== by 0x806CAD3: yield(_zhandle*, int) (TestClient.cc:107) ==2357== by 0x806CF72: watchCtx::waitForConnected(_zhandle*) (TestClient.cc:165) ==2357== by 0x8070402: Zookeeper_simpleSystem::testConnectIndex1() (TestClient.cc:846) ==2357== by 0x80727BF: CppUnit::TestCaller<Zookeeper_simpleSystem>::runTest() (TestCaller.h:166) ==2357== by 0x8091528: CppUnit::TestCaseMethodFunctor::operator()() const (in /home/steven/workspace/zookeeper-trunk-working/build/test/test-cppunit/zktest-st) ==2357== by 0x809BAAC: CppUnit::DefaultProtector::protect(CppUnit::Functor const&, CppUnit::ProtectorContext const&) (in /home/steven/workspace/zookeeper-trunk-working/build/test/test-cppunit/zktest-st) ==2357== by 0x809D402: CppUnit::ProtectorChain::ProtectFunctor::operator()() const (in /home/steven/workspace/zookeeper-trunk-working/build/test/test-cppunit/zktest-st) ==2357== by 0x809D0C1: CppUnit::ProtectorChain::protect(CppUnit::Functor const&, CppUnit::ProtectorContext const&) (in /home/steven/workspace/zookeeper-trunk-working/build/test/test-cppunit/zktest-st) ==2357== by 0x8098705: CppUnit::TestResult::protect(CppUnit::Functor const&, CppUnit::Test*, std::string const&) (in /home/steven/workspace/zookeeper-trunk-working/build/test/test-cppunit/zktest-st) ==2357== by 0x809131E: CppUnit::TestCase::run(CppUnit::TestResult*) (in /home/steven/workspace/zookeeper-trunk-working/build/test/test-cppunit/zktest-st) ==2357== by 0x809DBB2: CppUnit::TestComposite::doRunChildTests(CppUnit::TestResult*) (in /home/steven/workspace/zookeeper-trunk-working/build/test/test-cppunit/zktest-st) ==2357== Address 0x44db5f4 is 4 bytes inside a block of size 84 free'd ==2357== at 0x4024836: free (vg_replace_malloc.c:325) ==2357== by 0x804E04B: __wrap_free (LibCMocks.cc:197) ==2357== by 0x8087B5C: free_sync_completion (st_adaptor.c:56) ==2357== by 0x807E1FC: zoo_wexists (zookeeper.c:2909) ==2357== by 0x807E289: zoo_exists (zookeeper.c:2890) ==2357== by 0x806FDF6: Zookeeper_simpleSystem::testConnectIndex1() (TestClient.cc:840) There are a number that looks like this. In this case it looks like the sync_completion allocated by zoo_wexists is sticking around somewhere after zoo_wexists finishes and free()s it, and gets written to by the zookeeper_process() that gets called by the yield() in waitForConnected() .
        Hide
        Mahadev konar added a comment -

        steve,
        sorry was out on vacation. Did you get a chacne to take a look at it? did you find something?

        Show
        Mahadev konar added a comment - steve, sorry was out on vacation. Did you get a chacne to take a look at it? did you find something?
        Hide
        Steven Cheng added a comment -

        Hi Mahadev, haven't been able to look at it until now, I'll take a look and update on what I find.

        Show
        Steven Cheng added a comment - Hi Mahadev, haven't been able to look at it until now, I'll take a look and update on what I find.
        Hide
        Steven Cheng added a comment -

        Couldn't figure out where it's happening and I don't get quite the same error.

        Show
        Steven Cheng added a comment - Couldn't figure out where it's happening and I don't get quite the same error.
        Hide
        Patrick Hunt added a comment -

        What do we want to do here guys, push to 3.4.0? I'd really like to see this one get closed out, esp after all the work you've done.

        Show
        Patrick Hunt added a comment - What do we want to do here guys, push to 3.4.0? I'd really like to see this one get closed out, esp after all the work you've done.
        Hide
        Mahadev konar added a comment -

        this is quite critical but we wont be able to fix this before 3.3 deadline. Moving it to 3.4.

        Show
        Mahadev konar added a comment - this is quite critical but we wont be able to fix this before 3.3 deadline. Moving it to 3.4.
        Hide
        Mahadev konar added a comment -

        Pushing to 3.5. Not a blocker.

        Show
        Mahadev konar added a comment - Pushing to 3.5. Not a blocker.
        Hide
        Steven Cheng added a comment -

        Assigning this back to Mahadev, I have tried one more run adding the line setting the connect_index back to 0 and the tests pass on a fresh trunk checkout on a different machine.

        Unfortunately, I can't think of anything more to debug this problem.

        Show
        Steven Cheng added a comment - Assigning this back to Mahadev, I have tried one more run adding the line setting the connect_index back to 0 and the tests pass on a fresh trunk checkout on a different machine. Unfortunately, I can't think of anything more to debug this problem.

          People

          • Assignee:
            Mahadev konar
            Reporter:
            Mahadev konar
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:

              Development