ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-763

Deadlock on close w/ zkpython / c client

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.3.0
    • Fix Version/s: 3.3.1, 3.4.0
    • Component/s: contrib-bindings
    • Labels:
      None
    • Environment:

      ubuntu 10.04, zookeeper 3.3.0 and trunk

    • Hadoop Flags:
      Reviewed

      Description

      deadlocks occur if we attempt to close a handle while there are any outstanding async requests (aget, acreate, etc). Normally on close both the io thread terminates and the completion thread are terminated and joined, however w\ith outstanding async requests, the completion thread won't be in a joinable state, and we effectively hang when the main thread does the join.

      afaics ideal behavior would be on close of a handle, to effectively clear out any remaining callbacks and let the completion thread terminate.

      i've tried adding some bookkeeping to within a python client to guard against closing while there is an outstanding async completion request, but its an imperfect solution since even after the python callback is executed there is still a window for deadlock before the completion thread finishes the callback.

      a simple example to reproduce the deadlock is attached.

      1. ZOOKEEPER-763.patch
        3 kB
        Henry Robinson
      2. ZOOKEEPER-763.patch
        3 kB
        Henry Robinson
      3. stack-trace-deadlock.txt
        0.7 kB
        Kapil Thangavelu
      4. deadlock.py
        0.4 kB
        Kapil Thangavelu
      5. deadlock_v2.py
        0.4 kB
        Kapil Thangavelu

        Activity

        Kapil Thangavelu created issue -
        Hide
        Kapil Thangavelu added a comment -

        a simple example that exhibits deadlock behavior on close.

        Show
        Kapil Thangavelu added a comment - a simple example that exhibits deadlock behavior on close.
        Kapil Thangavelu made changes -
        Field Original Value New Value
        Attachment deadlock.py [ 12443579 ]
        Mahadev konar made changes -
        Assignee Mahadev konar [ mahadev ]
        Fix Version/s 3.4.0 [ 12314469 ]
        Affects Version/s 3.4.0 [ 12314469 ]
        Hide
        Kapil Thangavelu added a comment -

        A gdb stack trace on the process after it deadlocks.

        Show
        Kapil Thangavelu added a comment - A gdb stack trace on the process after it deadlocks.
        Kapil Thangavelu made changes -
        Attachment stack-trace-deadlock.txt [ 12443734 ]
        Hide
        Henry Robinson added a comment -

        Hi Kapil -

        As seems to be the norm for me this week, I'm struggling to reproduce It does seem like your python script explicitly waits for a completion to be called before closing a handle. Is this enough to leave an outstanding completion on the queue?

        Can you capture the stacktrace for the completion thread? I think it must be getting stuck in process_completions but it would be very valuable to know where - if it's stuck on the callback into zkpython then that means the deadlock is in the python bindings and not solely in C-land.

        cheers,
        Henry

        Show
        Henry Robinson added a comment - Hi Kapil - As seems to be the norm for me this week, I'm struggling to reproduce It does seem like your python script explicitly waits for a completion to be called before closing a handle. Is this enough to leave an outstanding completion on the queue? Can you capture the stacktrace for the completion thread? I think it must be getting stuck in process_completions but it would be very valuable to know where - if it's stuck on the callback into zkpython then that means the deadlock is in the python bindings and not solely in C-land. cheers, Henry
        Hide
        Kapil Thangavelu added a comment -

        Hi Henry

        The issue with the example i sent is that when the condition notify happens in the python callback, the main process thread can start running before the callback has exited and the completion thread will still be running. It could probably make be made more explicit for reproducing by inserting a time.sleep(1) line into the callback after the notify.

        This is the stack trace for the completion thread on deadlock.

        #0 0x00cb8422 in __kernel_vsyscall ()
        #1 0x00387245 in sem_wait@@GLIBC_2.1 () from /lib/tls/i686/cmov/libpthread.so.0
        #2 0x0810abe8 in PyThread_acquire_lock ()
        #3 0x080dcc11 in PyEval_EvalFrameEx ()
        #4 0x080e2807 in PyEval_EvalCodeEx ()
        #5 0x080e0c8b in PyEval_EvalFrameEx ()
        #6 0x080e1bb0 in PyEval_EvalFrameEx ()
        #7 0x080e2807 in PyEval_EvalCodeEx ()
        #8 0x0816b2ac in ?? ()
        #9 0x0806245a in PyObject_Call ()
        #10 0x080db892 in PyEval_CallObjectWithKeywords ()
        #11 0x080624f0 in PyObject_CallObject ()
        #12 0x00a8dd95 in data_completion_dispatch (rc=0, value=0xa01dcd8 "8\334\001\n\370\263N", value_len=0, stat=0xb773828c, data=0xa0035d0) at src/c/zookeeper.c:410
        #13 0x00f33ecc in process_completions (zh=0xa024ab0) at src/zookeeper.c:1778
        #14 0x00f4005b in do_completion (v=0xa024ab0) at src/mt_adaptor.c:333
        #15 0x0038096e in start_thread () from /lib/tls/i686/cmov/libpthread.so.0
        #16 0x00461a0e in clone () from /lib/tls/i686/cmov/libc.so.6

        thanks,
        Kapil

        Show
        Kapil Thangavelu added a comment - Hi Henry The issue with the example i sent is that when the condition notify happens in the python callback, the main process thread can start running before the callback has exited and the completion thread will still be running. It could probably make be made more explicit for reproducing by inserting a time.sleep(1) line into the callback after the notify. This is the stack trace for the completion thread on deadlock. #0 0x00cb8422 in __kernel_vsyscall () #1 0x00387245 in sem_wait@@GLIBC_2.1 () from /lib/tls/i686/cmov/libpthread.so.0 #2 0x0810abe8 in PyThread_acquire_lock () #3 0x080dcc11 in PyEval_EvalFrameEx () #4 0x080e2807 in PyEval_EvalCodeEx () #5 0x080e0c8b in PyEval_EvalFrameEx () #6 0x080e1bb0 in PyEval_EvalFrameEx () #7 0x080e2807 in PyEval_EvalCodeEx () #8 0x0816b2ac in ?? () #9 0x0806245a in PyObject_Call () #10 0x080db892 in PyEval_CallObjectWithKeywords () #11 0x080624f0 in PyObject_CallObject () #12 0x00a8dd95 in data_completion_dispatch (rc=0, value=0xa01dcd8 "8\334\001\n\370\263N", value_len=0, stat=0xb773828c, data=0xa0035d0) at src/c/zookeeper.c:410 #13 0x00f33ecc in process_completions (zh=0xa024ab0) at src/zookeeper.c:1778 #14 0x00f4005b in do_completion (v=0xa024ab0) at src/mt_adaptor.c:333 #15 0x0038096e in start_thread () from /lib/tls/i686/cmov/libpthread.so.0 #16 0x00461a0e in clone () from /lib/tls/i686/cmov/libc.so.6 thanks, Kapil
        Hide
        Kapil Thangavelu added a comment -

        new version of the deadlock script that uses an explicit sleep to get a thread switch, instead of looping around the function and hoping it happens.

        Show
        Kapil Thangavelu added a comment - new version of the deadlock script that uses an explicit sleep to get a thread switch, instead of looping around the function and hoping it happens.
        Kapil Thangavelu made changes -
        Attachment deadlock_v2.py [ 12443751 ]
        Hide
        Henry Robinson added a comment -

        Kapil -

        Thanks! Adding that sleep helped me understand what was going on.

        pyzoo_close has the GIL but blocks inside zookeeper_close, waiting for the completion thread to finish. However, if a completion is still inside Python, but has been pre-empted by the main thread which calls pyzoo_close, the completion can't get the GIL back to finish up executing, blocking the completions_thread for ever more. The fix is simple - relinquish the GIL during the zookeeper_close call, and then reacquire it straight after. There are even handy macros to do this:

        Py_BEGIN_ALLOW_THREADS
        ret = zookeeper_close(zhandles[zkhid]);
        Py_END_ALLOW_THREADS

        This same issue will affect any part of zkpython where a call to the C client is blocked on some work being completed in another Python thread - in practice, I think this means from callbacks. I'll audit the code to see if any other API calls are affected. Patch to fix this issue is following shortly - Kapil, I'd be very grateful if you could help us by testing it.

        cheers,
        Henry

        Show
        Henry Robinson added a comment - Kapil - Thanks! Adding that sleep helped me understand what was going on. pyzoo_close has the GIL but blocks inside zookeeper_close, waiting for the completion thread to finish. However, if a completion is still inside Python, but has been pre-empted by the main thread which calls pyzoo_close, the completion can't get the GIL back to finish up executing, blocking the completions_thread for ever more. The fix is simple - relinquish the GIL during the zookeeper_close call, and then reacquire it straight after. There are even handy macros to do this: Py_BEGIN_ALLOW_THREADS ret = zookeeper_close(zhandles [zkhid] ); Py_END_ALLOW_THREADS This same issue will affect any part of zkpython where a call to the C client is blocked on some work being completed in another Python thread - in practice, I think this means from callbacks. I'll audit the code to see if any other API calls are affected. Patch to fix this issue is following shortly - Kapil, I'd be very grateful if you could help us by testing it. cheers, Henry
        Henry Robinson made changes -
        Assignee Mahadev konar [ mahadev ] Henry Robinson [ henryr ]
        Fix Version/s 3.3.1 [ 12314846 ]
        Component/s c client [ 12312380 ]
        Hide
        Henry Robinson added a comment -

        Patch attached, with new test (note that test will hang if it fails, no obvious way to fix that - will open a JIRA to track).

        Also included is fix for type in acl_test.py; not sure how I allowed that to creep in :/

        Show
        Henry Robinson added a comment - Patch attached, with new test (note that test will hang if it fails, no obvious way to fix that - will open a JIRA to track). Also included is fix for type in acl_test.py; not sure how I allowed that to creep in :/
        Henry Robinson made changes -
        Attachment ZOOKEEPER-763.patch [ 12443765 ]
        Henry Robinson made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12443765/ZOOKEEPER-763.patch
        against trunk revision 941473.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 2 new or modified tests.

        -1 patch. The patch command could not apply the patch.

        Console output: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h1.grid.sp2.yahoo.net/83/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12443765/ZOOKEEPER-763.patch against trunk revision 941473. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 2 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h1.grid.sp2.yahoo.net/83/console This message is automatically generated.
        Hide
        Henry Robinson added a comment -

        Forgot --no-prefix again :/

        Show
        Henry Robinson added a comment - Forgot --no-prefix again :/
        Henry Robinson made changes -
        Attachment ZOOKEEPER-763.patch [ 12443769 ]
        Henry Robinson made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Henry Robinson made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Patrick Hunt added a comment -

        This addressed the problem on my machine (ubuntu karmic)

        Show
        Patrick Hunt added a comment - This addressed the problem on my machine (ubuntu karmic)
        Hide
        Kapil Thangavelu added a comment -

        works for me on a couple of different zkpython apps (ubuntu lucid) , thanks!

        Show
        Kapil Thangavelu added a comment - works for me on a couple of different zkpython apps (ubuntu lucid) , thanks!
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12443769/ZOOKEEPER-763.patch
        against trunk revision 941473.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 5 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/149/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/149/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/149/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12443769/ZOOKEEPER-763.patch against trunk revision 941473. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 5 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/149/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/149/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/149/console This message is automatically generated.
        Hide
        Patrick Hunt added a comment -

        updated patch for 3.3 branch

        Show
        Patrick Hunt added a comment - updated patch for 3.3 branch
        Patrick Hunt made changes -
        Attachment ZOOKEEPER-763_3_3_1.patch [ 12443781 ]
        Hide
        Patrick Hunt added a comment -

        +1, thanks Henry (and Kapil!)

        Show
        Patrick Hunt added a comment - +1, thanks Henry (and Kapil!)
        Patrick Hunt made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Hadoop Flags [Reviewed]
        Resolution Fixed [ 1 ]
        Patrick Hunt made changes -
        Attachment ZOOKEEPER-763_3_3_1.patch [ 12443781 ]
        Hide
        Patrick Hunt added a comment -

        For some reason I got confused on the 3.3 branch (may not have been up to date), the main patch applies to both just fine. Fixed this in svn.

        Show
        Patrick Hunt added a comment - For some reason I got confused on the 3.3 branch (may not have been up to date), the main patch applies to both just fine. Fixed this in svn.
        Mahadev konar made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Henry Robinson
            Reporter:
            Kapil Thangavelu
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development