ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-740

zkpython leading to segfault on zookeeper

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.3.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      The program that we are implementing uses the python binding for zookeeper but sometimes it crash with segfault; here is the bt from gdb:

      Program received signal SIGSEGV, Segmentation fault.
      [Switching to Thread 0xad244b70 (LWP 28216)]
      0x080611d5 in PyObject_Call (func=0x862fab0, arg=0x8837194, kw=0x0)
      at ../Objects/abstract.c:2488
      2488 ../Objects/abstract.c: No such file or directory.
      in ../Objects/abstract.c
      (gdb) bt
      #0 0x080611d5 in PyObject_Call (func=0x862fab0, arg=0x8837194, kw=0x0)
      at ../Objects/abstract.c:2488
      #1 0x080d6ef2 in PyEval_CallObjectWithKeywords (func=0x862fab0,
      arg=0x8837194, kw=0x0) at ../Python/ceval.c:3575
      #2 0x080612a0 in PyObject_CallObject (o=0x862fab0, a=0x8837194)
      at ../Objects/abstract.c:2480
      #3 0x0047af42 in watcher_dispatch (zzh=0x86174e0, type=-1, state=1,
      path=0x86337c8 "", context=0x8588660) at src/c/zookeeper.c:314
      #4 0x00496559 in do_foreach_watcher (zh=0x86174e0, type=-1, state=1,
      path=0x86337c8 "", list=0xa5354140) at src/zk_hashtable.c:275
      #5 deliverWatchers (zh=0x86174e0, type=-1, state=1, path=0x86337c8 "",
      list=0xa5354140) at src/zk_hashtable.c:317
      #6 0x0048ae3c in process_completions (zh=0x86174e0) at src/zookeeper.c:1766
      #7 0x0049706b in do_completion (v=0x86174e0) at src/mt_adaptor.c:333
      #8 0x0013380e in start_thread () from /lib/tls/i686/cmov/libpthread.so.0
      #9 0x002578de in clone () from /lib/tls/i686/cmov/libc.so.6

        Issue Links

          Activity

          Federico created issue -
          Lei Zhang made changes -
          Field Original Value New Value
          Link This issue duplicates ZOOKEEPER-670 [ ZOOKEEPER-670 ]
          Hide
          Henry Robinson added a comment -

          Thanks for the bug report, Federico. There are a few similar issues with zkpython that are surely related - I must make sure to address them soon.

          Can you let me know what version of Python and what operating system you are running?

          Show
          Henry Robinson added a comment - Thanks for the bug report, Federico. There are a few similar issues with zkpython that are surely related - I must make sure to address them soon. Can you let me know what version of Python and what operating system you are running?
          Hide
          Federico added a comment -

          Thanks, we use servers with ubuntu 9.10 32 bit and the python from the repository (version 2.6.4)

          Show
          Federico added a comment - Thanks, we use servers with ubuntu 9.10 32 bit and the python from the repository (version 2.6.4)
          Henry Robinson made changes -
          Link This issue relates to ZOOKEEPER-631 [ ZOOKEEPER-631 ]
          Hide
          Henry Robinson added a comment -

          Adding link - 631 should hopefully resolve this.

          Show
          Henry Robinson added a comment - Adding link - 631 should hopefully resolve this.
          Henry Robinson made changes -
          Assignee Henry Robinson [ henryr ]
          Henry Robinson made changes -
          Fix Version/s 3.4.0 [ 12314469 ]
          Hide
          Patrick Hunt added a comment -

          Fixed as part of ZOOKEEPER-631

          Show
          Patrick Hunt added a comment - Fixed as part of ZOOKEEPER-631
          Patrick Hunt made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Hide
          Federico added a comment -

          I applied the patch but i've got again a segfault:

          Program received signal SIGSEGV, Segmentation fault.
          [Switching to Thread 0xad244b70 (LWP 18205)]
          0x080611d5 in PyObject_Call (func=0x87fa8a0, arg=0xb53d9b4, kw=0x0) at ../Objects/abstract.c:2488
          2488 ../Objects/abstract.c: No such file or directory.
          in ../Objects/abstract.c
          (gdb) bt
          #0 0x080611d5 in PyObject_Call (func=0x87fa8a0, arg=0xb53d9b4, kw=0x0) at ../Objects/abstract.c:2488
          #1 0x080d6ef2 in PyEval_CallObjectWithKeywords (func=0x87fa8a0, arg=0xb53d9b4, kw=0x0) at ../Python/ceval.c:3575
          #2 0x080612a0 in PyObject_CallObject (o=0x87fa8a0, a=0xb53d9b4) at ../Objects/abstract.c:2480
          #3 0x0047afb2 in watcher_dispatch (zzh=0x8617038, type=-1, state=1, path=0x8631028 "", context=0x8617258) at src/c/zookeeper.c:419
          #4 0x00497559 in do_foreach_watcher (zh=0x8617038, type=-1, state=1, path=0x8631028 "", list=0x8a71088) at src/zk_hashtable.c:275
          #5 deliverWatchers (zh=0x8617038, type=-1, state=1, path=0x8631028 "", list=0x8a71088) at src/zk_hashtable.c:317
          #6 0x0048be3c in process_completions (zh=0x8617038) at src/zookeeper.c:1766
          #7 0x0049806b in do_completion (v=0x8617038) at src/mt_adaptor.c:333
          #8 0x0013380e in start_thread () from /lib/tls/i686/cmov/libpthread.so.0
          #9 0x002578de in clone () from /lib/tls/i686/cmov/libc.so.6

          I'm not sure it's resolved.

          Show
          Federico added a comment - I applied the patch but i've got again a segfault: Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0xad244b70 (LWP 18205)] 0x080611d5 in PyObject_Call (func=0x87fa8a0, arg=0xb53d9b4, kw=0x0) at ../Objects/abstract.c:2488 2488 ../Objects/abstract.c: No such file or directory. in ../Objects/abstract.c (gdb) bt #0 0x080611d5 in PyObject_Call (func=0x87fa8a0, arg=0xb53d9b4, kw=0x0) at ../Objects/abstract.c:2488 #1 0x080d6ef2 in PyEval_CallObjectWithKeywords (func=0x87fa8a0, arg=0xb53d9b4, kw=0x0) at ../Python/ceval.c:3575 #2 0x080612a0 in PyObject_CallObject (o=0x87fa8a0, a=0xb53d9b4) at ../Objects/abstract.c:2480 #3 0x0047afb2 in watcher_dispatch (zzh=0x8617038, type=-1, state=1, path=0x8631028 "", context=0x8617258) at src/c/zookeeper.c:419 #4 0x00497559 in do_foreach_watcher (zh=0x8617038, type=-1, state=1, path=0x8631028 "", list=0x8a71088) at src/zk_hashtable.c:275 #5 deliverWatchers (zh=0x8617038, type=-1, state=1, path=0x8631028 "", list=0x8a71088) at src/zk_hashtable.c:317 #6 0x0048be3c in process_completions (zh=0x8617038) at src/zookeeper.c:1766 #7 0x0049806b in do_completion (v=0x8617038) at src/mt_adaptor.c:333 #8 0x0013380e in start_thread () from /lib/tls/i686/cmov/libpthread.so.0 #9 0x002578de in clone () from /lib/tls/i686/cmov/libc.so.6 I'm not sure it's resolved.
          Hide
          Henry Robinson added a comment -

          Ok, thanks for the update. Can you share the code that you are running to give the segfault? That will make it much easier for me to diagnose.

          Show
          Henry Robinson added a comment - Ok, thanks for the update. Can you share the code that you are running to give the segfault? That will make it much easier for me to diagnose.
          Henry Robinson made changes -
          Resolution Fixed [ 1 ]
          Status Resolved [ 5 ] Reopened [ 4 ]
          Hide
          Mahadev konar added a comment -

          henry,

          looks like we can move this out of 3.3.1 release. Do you wnat to assign this to just 3.4?

          Show
          Mahadev konar added a comment - henry, looks like we can move this out of 3.3.1 release. Do you wnat to assign this to just 3.4?
          Hide
          Henry Robinson added a comment -

          Can't reproduce, or diagnose without code, moving to 3.4.0.

          Show
          Henry Robinson added a comment - Can't reproduce, or diagnose without code, moving to 3.4.0.
          Henry Robinson made changes -
          Fix Version/s 3.3.1 [ 12314846 ]
          Hide
          Mike Solomon added a comment -

          The common cases is when you supply a watcher for a get().

          Start a zk server on localhost and create a node /zk.

          Connect a python process like this:

          import sys
          import zookeeper

          zh = zookeeper.init('localhost:2181')

          def _zk_callback(*args):
          print >> sys.stderr, "_zk_callback", args

          zookeeper.get(zh, '/zk', _zk_callback)

          Kill the zk server. The client will idle fine. When restarting the zk server, I get a SIGSEGV on reconnect 100% of the time.

          This is fixed by the following patch:

          [msolomon]yuriko:~/src/zookeeper-3.3.1/src/contrib/zkpython> svn di
          Index: src/c/zookeeper.c
          ===================================================================
          — src/c/zookeeper.c (revision 951628)
          +++ src/c/zookeeper.c (working copy)
          @@ -436,7 +436,9 @@
          if (PyObject_CallObject((PyObject*)callback, arglist) == NULL)

          { PyErr_Print(); }
          • if (pyw->permanent == 0) {
            + // msolomon: when a session event happens, watchers get dispatched,
            + // but they are retained in the C client for dispatch again.
            + if (pyw->permanent == 0 && type != ZOO_SESSION_EVENT) { free_pywatcher(pyw); }

            PyGILState_Release(gstate);

          Show
          Mike Solomon added a comment - The common cases is when you supply a watcher for a get(). Start a zk server on localhost and create a node /zk. Connect a python process like this: import sys import zookeeper zh = zookeeper.init('localhost:2181') def _zk_callback(*args): print >> sys.stderr, "_zk_callback", args zookeeper.get(zh, '/zk', _zk_callback) Kill the zk server. The client will idle fine. When restarting the zk server, I get a SIGSEGV on reconnect 100% of the time. This is fixed by the following patch: [msolomon] yuriko:~/src/zookeeper-3.3.1/src/contrib/zkpython> svn di Index: src/c/zookeeper.c =================================================================== — src/c/zookeeper.c (revision 951628) +++ src/c/zookeeper.c (working copy) @@ -436,7 +436,9 @@ if (PyObject_CallObject((PyObject*)callback, arglist) == NULL) { PyErr_Print(); } if (pyw->permanent == 0) { + // msolomon: when a session event happens, watchers get dispatched, + // but they are retained in the C client for dispatch again. + if (pyw->permanent == 0 && type != ZOO_SESSION_EVENT) { free_pywatcher(pyw); } PyGILState_Release(gstate);
          Hide
          Henry Robinson added a comment -

          Mike -

          Great catch, thanks for figuring this out.

          I'm correct in saying that this doesn't prevent watchers from eventually being correctly freed, right?

          If so, then it would be great if you could submit this patch formally so that we can get it into trunk. See http://wiki.apache.org/hadoop/ZooKeeper/HowToContribute for details.

          Thanks,
          Henry

          Show
          Henry Robinson added a comment - Mike - Great catch, thanks for figuring this out. I'm correct in saying that this doesn't prevent watchers from eventually being correctly freed, right? If so, then it would be great if you could submit this patch formally so that we can get it into trunk. See http://wiki.apache.org/hadoop/ZooKeeper/HowToContribute for details. Thanks, Henry
          Hide
          Andrei Savu added a comment -

          I've created a patch as Mike suggested. I've been able to reproduce the issue and the change seems to fix it. I don't know if it prevent watchers from eventually being correctly freed. I've also run all the tests and everything seems to work fine.

          Show
          Andrei Savu added a comment - I've created a patch as Mike suggested. I've been able to reproduce the issue and the change seems to fix it. I don't know if it prevent watchers from eventually being correctly freed. I've also run all the tests and everything seems to work fine.
          Andrei Savu made changes -
          Attachment ZOOKEEPER-740.patch [ 12453927 ]
          Andrei Savu made changes -
          Status Reopened [ 4 ] Patch Available [ 10002 ]
          Austin Shoemaker made changes -
          Link This issue is duplicated by ZOOKEEPER-888 [ ZOOKEEPER-888 ]
          Hide
          Austin Shoemaker added a comment -

          ZOOKEEPER-740.patch fixes the crash, though it looks like the pywatcher_t will be leaked on an unrecoverable session state change (EXPIRED_SESSION_STATE or AUTH_FAILED_STATE). Attached a proposed revision to ZOOKEEPER-888 for your review.

          Show
          Austin Shoemaker added a comment - ZOOKEEPER-740 .patch fixes the crash, though it looks like the pywatcher_t will be leaked on an unrecoverable session state change (EXPIRED_SESSION_STATE or AUTH_FAILED_STATE). Attached a proposed revision to ZOOKEEPER-888 for your review.
          Hide
          Andrei Savu added a comment -

          It seems like this issue is fixed on the trunk by patch ZOOKEEPER-888 but unfortunately the 3.3 branch still gives a segfault. I've tested using the code sample provided by Mike Solomon.

          Show
          Andrei Savu added a comment - It seems like this issue is fixed on the trunk by patch ZOOKEEPER-888 but unfortunately the 3.3 branch still gives a segfault. I've tested using the code sample provided by Mike Solomon.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12453927/ZOOKEEPER-740.patch
          against trunk revision 1033155.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          -1 patch. The patch command could not apply the patch.

          Console output: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/18//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12453927/ZOOKEEPER-740.patch against trunk revision 1033155. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch. The patch command could not apply the patch. Console output: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/18//console This message is automatically generated.
          Hide
          Patrick Hunt added a comment -

          Looks like the patch is failing to apply. Could someone update and resubmit?

          Show
          Patrick Hunt added a comment - Looks like the patch is failing to apply. Could someone update and resubmit?
          Patrick Hunt made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          Austin Shoemaker added a comment -

          I had also uploaded a ZOOKEEPER-888 patch based on the 3.3 branch (ZOOKEEPER-888-3.3.patch). Maybe that helps?

          Show
          Austin Shoemaker added a comment - I had also uploaded a ZOOKEEPER-888 patch based on the 3.3 branch ( ZOOKEEPER-888 -3.3.patch). Maybe that helps?
          Hide
          Mahadev konar added a comment -

          Any update on this? Should we try and get this to 3.4 release?

          Show
          Mahadev konar added a comment - Any update on this? Should we try and get this to 3.4 release?
          Hide
          Mahadev konar added a comment -

          Moving this out to 3.5 release.

          Show
          Mahadev konar added a comment - Moving this out to 3.5 release.
          Mahadev konar made changes -
          Fix Version/s 3.5.0 [ 12316644 ]
          Fix Version/s 3.4.0 [ 12314469 ]
          iceman made changes -
          Priority Critical [ 2 ] Major [ 3 ]
          Hide
          Michi Mutsuzaki added a comment -

          This issue has been fixed in 3.4 by ZOOKEEPER-888.

          Show
          Michi Mutsuzaki added a comment - This issue has been fixed in 3.4 by ZOOKEEPER-888 .
          Michi Mutsuzaki made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Fix Version/s 3.5.0 [ 12316644 ]
          Resolution Fixed [ 1 ]
          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Resolved Resolved Reopened Reopened
          1d 8h 52m 1 Henry Robinson 23/Apr/10 16:35
          Reopened Reopened Patch Available Patch Available
          135d 16h 52m 1 Andrei Savu 06/Sep/10 09:27
          Patch Available Patch Available Open Open
          64d 17h 18m 1 Patrick Hunt 10/Nov/10 01:46
          Open Open Resolved Resolved
          1270d 22h 33m 2 Michi Mutsuzaki 25/Apr/14 02:45

            People

            • Assignee:
              Henry Robinson
              Reporter:
              Federico
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development