ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-333

helgrind thread issues identified in mt c client code

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.1.1, 3.2.0
    • Component/s: c client
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      helgrind generated a number of issues, I pulled a bunch of them. Most are related to the test, some are really issues with the mt zk client code though:

      valgrind --tool=helgrind --log-file=helgrind_mt.out ./zktest-mt

      ==31294== Thread #2: pthread_cond_

      {timed}wait called with un-held mutex
      ==31294== at 0x4027F8F: pthread_cond_wait@* (hg_intercepts.c:560)
      ==31294== by 0x404D881: pthread_cond_wait@GLIBC_2.0 (in /lib/tls/i686/cmov/libpthread-2.8.90.so)
      ==31294== by 0x4028037: pthread_cond_wait@* (hg_intercepts.c:574)
      ==31294== by 0x809EBB7: pthread_cond_wait (PthreadMocks.cc:54)
      ==31294== by 0x80ABCF6: notify_thread_ready (mt_adaptor.c:136)
      ==31294== by 0x80ABE90: do_io (mt_adaptor.c:277)


      ==31294== Possible data race during write of size 4 at 0x42E9A58
      ==31294== at 0x8050D83: terminateZookeeperThreads(_zhandle*) (ZKMocks.cc:518)
      ==31294== by 0x805543B: DeliverWatchersWrapper::call(_zhandle*, int, int, char const*, watcher_object_list**) (ZKMocks.cc:261)
      ==31294== by 0x80520F7: __wrap_deliverWatchers (ZKMocks.cc:220)
      ==31294== by 0x80A287B: process_completions (zookeeper.c:1393)
      ==31294== by 0x80ABDAA: do_completion (mt_adaptor.c:332)

      ==31294== Possible data race during write of size 4 at 0xBEFF5F30
      ==31294== at 0x80589AF: Zookeeper_watchers::ConnectionWatcher::~ConnectionWatcher() (TestWatchers.cc:54)
      ==31294== by 0x805D062: Zookeeper_watchers::testDefaultSessionWatcher1() (TestWatchers.cc:438)
      ==31294== by 0x805608C: CppUnit::TestCaller<Zookeeper_watchers>::runTest() (TestCaller.h:166)


      ==31294== Possible data race during write of size 4 at 0x42EB104
      ==31294== at 0x80A03EE: queue_completion (zookeeper.c:1776)
      ==31294== by 0x80A3A44: zookeeper_process (zookeeper.c:1598)
      ==31294== by 0x80AC00B: do_io (mt_adaptor.c:309)


      ==31294== Thread #29: pthread_cond_{timed}

      wait called with un-held mutex
      ==31294== at 0x4027F8F: pthread_cond_wait@* (hg_intercepts.c:560)
      ==31294== by 0x404D881: pthread_cond_wait@GLIBC_2.0 (in /lib/tls/i686/cmov/libpthread-2.8.90.so)
      ==31294== by 0x4028037: pthread_cond_wait@* (hg_intercepts.c:574)
      ==31294== by 0x809EBB7: pthread_cond_wait (PthreadMocks.cc:54)
      ==31294== by 0x80AB9B3: wait_sync_completion (mt_adaptor.c:82)
      ==31294== by 0x80A1E82: zoo_wget (zookeeper.c:2517)
      ==31294== by 0x80A1F13: zoo_get (zookeeper.c:2497)

      1. helgrind_mt.out
        22 kB
        Mahadev konar
      2. helgrind_mt.out.gz
        16 kB
        Patrick Hunt
      3. ZOOKEEPER-333.patch
        0.6 kB
        Mahadev konar

        Activity

        Hide
        Mahadev konar added a comment -

        i found some more while running helgrind... attacching the filtered output.

        Show
        Mahadev konar added a comment - i found some more while running helgrind... attacching the filtered output.
        Hide
        Chris Darroch added a comment -

        Some of these look to be innocuous on a first glance, such as "wait called with un-held mutex" in wait_sync_completion(). I suppose the issue here might be that wait_sync_completion() doesn't check the return value from pthread_mutex_lock() before proceeding, but in practice I believe one can reasonably assume it always succeeds.

        One thing that might help a bit would be to have a reference to the C code which produced this output because it refers to line numbers in the code. Was the run against 3.1.0 or against a snapshot from trunk? If the latter, do you know the SVN revision number?

        Show
        Chris Darroch added a comment - Some of these look to be innocuous on a first glance, such as "wait called with un-held mutex" in wait_sync_completion(). I suppose the issue here might be that wait_sync_completion() doesn't check the return value from pthread_mutex_lock() before proceeding, but in practice I believe one can reasonably assume it always succeeds. One thing that might help a bit would be to have a reference to the C code which produced this output because it refers to line numbers in the code. Was the run against 3.1.0 or against a snapshot from trunk? If the latter, do you know the SVN revision number?
        Hide
        Mahadev konar added a comment - - edited

        chris,
        the run was from the trunk. I dont have a revision number but i think their hasnt been any code change to the source files after I uploaded the traces. So the line numbers should match.

        Show
        Mahadev konar added a comment - - edited chris, the run was from the trunk. I dont have a revision number but i think their hasnt been any code change to the source files after I uploaded the traces. So the line numbers should match.
        Hide
        Patrick Hunt added a comment -

        This latest gzip'd helgrind output file is from svn trunk:

        At revision 751008.

        Note: the tests hang at the following step:

        Zookeeper_operations::testOperationsAndDisconnectConcurrently1^CKilled

        I had to ctrl-c the test to stop. Perhaps this means helgrind is tickling a bad problem(s)? We should re run helgrind after resolving the basic issues currently in the log, and see if this hang still occurrs. Perhaps you guys could also look at the test in question to see if anything catches your eye? The test should not be failing.

        Show
        Patrick Hunt added a comment - This latest gzip'd helgrind output file is from svn trunk: At revision 751008. Note: the tests hang at the following step: Zookeeper_operations::testOperationsAndDisconnectConcurrently1^CKilled I had to ctrl-c the test to stop. Perhaps this means helgrind is tickling a bad problem(s)? We should re run helgrind after resolving the basic issues currently in the log, and see if this hang still occurrs. Perhaps you guys could also look at the test in question to see if anything catches your eye? The test should not be failing.
        Hide
        Patrick Hunt added a comment -

        ps. I'm on ubuntu intrepid ibex latest, single core cpu.

        Show
        Patrick Hunt added a comment - ps. I'm on ubuntu intrepid ibex latest, single core cpu.
        Hide
        Mahadev konar added a comment -

        Helgrind: Fatal internal error – cannot continue.
        Helgrind: mk_SHVAL_ShR(tset=8192,lset=1): FAILED
        Helgrind: max allowed tset=8191, lset=131071
        Helgrind: program has too many thread sets or lock sets to track.

        this is the error generated by helgrind when it hangs with the tests and is not able to run all of the tests.

        Show
        Mahadev konar added a comment - Helgrind: Fatal internal error – cannot continue. Helgrind: mk_SHVAL_ShR(tset=8192,lset=1): FAILED Helgrind: max allowed tset=8191, lset=131071 Helgrind: program has too many thread sets or lock sets to track. this is the error generated by helgrind when it hangs with the tests and is not able to run all of the tests.
        Hide
        Mahadev konar added a comment -

        most of the traces in the helgrind output seem harmless to me. This patch has some minor changes (declaring the head pointers of lists as volatile, which we keep checking in different threads without locking), which might cause some problems. the others I think is fine.

        Show
        Mahadev konar added a comment - most of the traces in the helgrind output seem harmless to me. This patch has some minor changes (declaring the head pointers of lists as volatile, which we keep checking in different threads without locking), which might cause some problems. the others I think is fine.
        Hide
        Patrick Hunt added a comment -

        Looks good, +1.

        Committed revision 752711.
        Committed revision 752726.

        Show
        Patrick Hunt added a comment - Looks good, +1. Committed revision 752711. Committed revision 752726.
        Hide
        Hudson added a comment -

        Integrated in ZooKeeper-trunk #253 (See http://hudson.zones.apache.org/hudson/job/ZooKeeper-trunk/253/)

        Show
        Hudson added a comment - Integrated in ZooKeeper-trunk #253 (See http://hudson.zones.apache.org/hudson/job/ZooKeeper-trunk/253/ )

          People

          • Assignee:
            Mahadev konar
            Reporter:
            Patrick Hunt
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development