Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 3.3.2, 3.4.0
    • c client
    • None
    • Reviewed

    Description

      We observed a crash while closing our c client. It was in the do_io() thread that was processing as during the close() call.

      #0 queue_buffer (list=0x6bd4f8, b=0x0, add_to_front=0) at src/zookeeper.c:969
      #1 0x000000000046234e in check_events (zh=0x6bd480, events=<value optimized out>) at src/zookeeper.c:1687
      #2 0x0000000000462d74 in zookeeper_process (zh=0x6bd480, events=2) at src/zookeeper.c:1971
      #3 0x0000000000469c34 in do_io (v=0x6bd480) at src/mt_adaptor.c:311
      #4 0x00007ffff7bc59ca in start_thread () from /lib/libpthread.so.0
      #5 0x00007ffff6f706fd in clone () from /lib/libc.so.6
      #6 0x0000000000000000 in ?? ()

      We tracked down the sequence of events, and the cause is that input_buffer is being freed from a thread other than the do_io thread that relies on it:

      1. do_io() call check_events()
      2. if(events&ZOOKEEPER_READ) branch executes
      3. if (rc > 0) branch executes
      4. if (zh->input_buffer != &zh->primer_buffer) branch executes
      .....in the meantime......
      5. zookeeper_close() called
      6. if (inc_ref_counter(zh,0)!=0) branch executes
      7. cleanup_bufs() is called
      8. input_buffer is freed at the end
      ..... back to check_events().........
      9. queue_events() is called on a NULL buffer.

      I believe the patch is to only call free_completions() in zookeeper_close() and not cleanup_bufs(). The original reason cleanup_bufs() was added was to call any outstanding synhcronous completions, so only free_completions (which is guarded) is needed. I will submit a patch for review with this change.

      Attachments

        1. ZOOKEEEPER-897.diff
          0.4 kB
          Jared Cantwell
        2. ZOOKEEPER-897.patch
          0.5 kB
          Jared Cantwell

        Activity

          jaredc Jared Cantwell added a comment -

          Patch that only calls free_completions in zookeeper_close() instead of cleanup_bufs().

          jaredc Jared Cantwell added a comment - Patch that only calls free_completions in zookeeper_close() instead of cleanup_bufs().
          jaredc Jared Cantwell added a comment -

          Updated patch format and spelling.

          jaredc Jared Cantwell added a comment - Updated patch format and spelling.
          mahadev Mahadev Konar added a comment -

          jared,
          The patch that you provided leaks memory for the zookeeper client. We have to clean up the tosend and to process buffers on close and free them. Did you observe the problem with which release?

          I had tried to fix all the issues with zookeeper_close() in ZOOKEEPER-591. Also, michi has fixed a couple other issues in ZOOKEEPER-804.

          what version of code are you running? Also, can you provide some test case which causes this issue? (I know its hard to reproduce but even a test that reproduces it once in 10-20 times is good enough).

          mahadev Mahadev Konar added a comment - jared, The patch that you provided leaks memory for the zookeeper client. We have to clean up the tosend and to process buffers on close and free them. Did you observe the problem with which release? I had tried to fix all the issues with zookeeper_close() in ZOOKEEPER-591 . Also, michi has fixed a couple other issues in ZOOKEEPER-804 . what version of code are you running? Also, can you provide some test case which causes this issue? (I know its hard to reproduce but even a test that reproduces it once in 10-20 times is good enough).
          jaredc Jared Cantwell added a comment -

          we are using the 3.3.2 release. i don't think the patch leaks memory because destroy() will eventually get called (by the reentrant call to zookeeper_close()), which calls cleanup_bufs() and frees those buffers, right? Also, i had a test that reproduced this error, but it was easiest to reproduce if i injected artificial sleeps into the zookeeper.c file. If that's ok, then I can submit that. Otherwise, i'll see if i can devise a test that can reproduce it otherwise.

          jaredc Jared Cantwell added a comment - we are using the 3.3.2 release. i don't think the patch leaks memory because destroy() will eventually get called (by the reentrant call to zookeeper_close()), which calls cleanup_bufs() and frees those buffers, right? Also, i had a test that reproduced this error, but it was easiest to reproduce if i injected artificial sleeps into the zookeeper.c file. If that's ok, then I can submit that. Otherwise, i'll see if i can devise a test that can reproduce it otherwise.
          mahadev Mahadev Konar added a comment -

          You are right Jared. The patch looks good to me. I am trying to rerun the cpp unit tests a couple of times to reverify that the client close doesnt create any problems. It would be great if you could rerun the cppunit tests couple of times to endorse the fix.

          mahadev Mahadev Konar added a comment - You are right Jared. The patch looks good to me. I am trying to rerun the cpp unit tests a couple of times to reverify that the client close doesnt create any problems. It would be great if you could rerun the cppunit tests couple of times to endorse the fix.
          jaredc Jared Cantwell added a comment -

          I wasn't able to figure out how to write a test for this scenario without injecting some helper code into zookeeper.c. I ran the cpp unittests ~20 times locally and no issues. Also, I've temporarily been using this in our test environment for ~2 weeks and no issues there either.

          jaredc Jared Cantwell added a comment - I wasn't able to figure out how to write a test for this scenario without injecting some helper code into zookeeper.c. I ran the cpp unittests ~20 times locally and no issues. Also, I've temporarily been using this in our test environment for ~2 weeks and no issues there either.

          perhaps we should rely on existing testing for this one, but enter a new jira to refactor the client, specifically to allow testing? (ie a way to inject the helper code w/o needing to edit zookeeper.c directly)

          phunt Patrick D. Hunt added a comment - perhaps we should rely on existing testing for this one, but enter a new jira to refactor the client, specifically to allow testing? (ie a way to inject the helper code w/o needing to edit zookeeper.c directly)
          mahadev Mahadev Konar added a comment -

          jared, pat,
          I am ok without a test case for this one, because its a quite hard to create one. I just wanted someone else to run the tests on there machines just to verify (since I rarely see any problems in c tests on my machine). I will go ahead and commit this patch for now.

          mahadev Mahadev Konar added a comment - jared, pat, I am ok without a test case for this one, because its a quite hard to create one. I just wanted someone else to run the tests on there machines just to verify (since I rarely see any problems in c tests on my machine). I will go ahead and commit this patch for now.
          mahadev Mahadev Konar added a comment -

          I just committed this. thanks jared!

          mahadev Mahadev Konar added a comment - I just committed this. thanks jared!
          hudson Hudson added a comment -

          Integrated in ZooKeeper-trunk #983 (See https://hudson.apache.org/hudson/job/ZooKeeper-trunk/983/)
          ZOOKEEPER-897. C Client seg faults during close (jared cantwell via mahadev)

          hudson Hudson added a comment - Integrated in ZooKeeper-trunk #983 (See https://hudson.apache.org/hudson/job/ZooKeeper-trunk/983/ ) ZOOKEEPER-897 . C Client seg faults during close (jared cantwell via mahadev)

          People

            jaredc Jared Cantwell
            jaredc Jared Cantwell
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: