CouchDB
  1. CouchDB
  2. COUCHDB-275

couch crashes erlang vm under heavy load

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.9
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Environment:

      Linux melkjug.com 2.6.23-gentoo-r8 #1 SMP Wed Feb 13 14:28:49 EST 2008 x86_64 QEMU Virtual CPU version 0.9.1 GenuineIntel GNU/Linux

      Description

      I clicked "Compact" in futon for my 11G database at 9:04 AM EST:

      [Mon, 02 Mar 2009 14:04:32 GMT] [info] [<0.59.0>] Starting compaction for db "melkjug"

      An hour and a half later it was 85% finished and then the following was output to stderr:

      heart: Mon Mar 2 10:33:20 2009: heart-beat time-out.
      /usr/bin/couchdb: line 255: echo: write error: Broken pipe
      heart: Mon Mar 2 10:33:22 2009: Executed "/usr/bin/couchdb -k". Terminating.

      I am retaining my 4.3G melkjug.couch.compact file in case it's useful in debugging this.

      1. term_to_binary_fix.diff
        17 kB
        Damien Katz
      2. 2009-03-05-couch.log.snippet
        1.24 MB
        Joshua Bronson

        Activity

        Hide
        Noah Slater added a comment -

        Just for clarification, the issue here is related to the Erlang VM crashing.

        The weird broken pipe error is a benign artefact of the Erlang heart process trying to restart CouchDB.

        Show
        Noah Slater added a comment - Just for clarification, the issue here is related to the Erlang VM crashing. The weird broken pipe error is a benign artefact of the Erlang heart process trying to restart CouchDB.
        Hide
        Joshua Bronson added a comment -

        This happened again last night, except not during compaction but rather during one of my application's update cycles (which slams couch with requests). Should have mentioned earlier, I am running off of trunk r746690. This did not used to happen when we were running off of r745993. Could running off a different revision of trunk help us avoid this? If so could you recommend one?

        Thanks,
        Josh

        Show
        Joshua Bronson added a comment - This happened again last night, except not during compaction but rather during one of my application's update cycles (which slams couch with requests). Should have mentioned earlier, I am running off of trunk r746690. This did not used to happen when we were running off of r745993. Could running off a different revision of trunk help us avoid this? If so could you recommend one? Thanks, Josh
        Hide
        Jan Lehnardt added a comment -

        What kind of load?

        Can you paste more relevant couch/log entries?

        Show
        Jan Lehnardt added a comment - What kind of load? Can you paste more relevant couch/log entries?
        Hide
        Joshua Bronson added a comment -

        See http://paste.pocoo.org/show/106370/ for the last 100 lines of couch.log. The last logged request is at 03 Mar 2009 08:11:48 GMT and at 08:12:04 GMT couch output the broken pipe error to stderr (see http://paste.pocoo.org/show/106368/ for the last 100 lines of stderr output and http://paste.pocoo.org/show/106367/ for the last 100 lines of stdout).

        Show
        Joshua Bronson added a comment - See http://paste.pocoo.org/show/106370/ for the last 100 lines of couch.log. The last logged request is at 03 Mar 2009 08:11:48 GMT and at 08:12:04 GMT couch output the broken pipe error to stderr (see http://paste.pocoo.org/show/106368/ for the last 100 lines of stderr output and http://paste.pocoo.org/show/106367/ for the last 100 lines of stdout).
        Hide
        Joshua Bronson added a comment -

        updated to r749851 as jan___ in #couchdb told me about a file descriptor leak prior to r748299, and retried compaction. this time compaction appears to have successfully completed, at least the following was output to couch.log:

        [Wed, 04 Mar 2009 23:19:23 GMT] [debug] [<0.8952.0>] CouchDB swapping files /var/lib/couchdb/melkjug.couch and /var/lib/couchdb/melkjug.couch.compact.

        however, 11 seconds later (adjusting for my local time of GMT-5) the erlang vm appears to have crashed again: the following was output to stderr:

        heart: Wed Mar 4 18:19:34 2009: heart-beat time-out.
        heart: Wed Mar 4 18:19:39 2009: Unable to kill old process, kill failed (tried multiple times).
        heart: Wed Mar 4 18:19:40 2009: Executed "/usr/bin/couchdb -k". Terminating.

        Show
        Joshua Bronson added a comment - updated to r749851 as jan___ in #couchdb told me about a file descriptor leak prior to r748299, and retried compaction. this time compaction appears to have successfully completed, at least the following was output to couch.log: [Wed, 04 Mar 2009 23:19:23 GMT] [debug] [<0.8952.0>] CouchDB swapping files /var/lib/couchdb/melkjug.couch and /var/lib/couchdb/melkjug.couch.compact. however, 11 seconds later (adjusting for my local time of GMT-5) the erlang vm appears to have crashed again: the following was output to stderr: heart: Wed Mar 4 18:19:34 2009: heart-beat time-out. heart: Wed Mar 4 18:19:39 2009: Unable to kill old process, kill failed (tried multiple times). heart: Wed Mar 4 18:19:40 2009: Executed "/usr/bin/couchdb -k". Terminating.
        Hide
        Joshua Bronson added a comment -

        couch crashed again while under heavy load early this morning. attaching an excerpt of the last 26 seconds of couch.log before the crash, 10:34:00 GMT to 10:34:26 GMT, in case it's more convenient than the pastebin link, all it indicates is the amount of load we're putting on couch – 3422 requests in 26 seconds (~130req/sec) is typical load over the course of the 2.5 hours we update the database with people's news. The 15:10:43 GMT timestamp of the last few lines of the excerpt reflects that couch was down for the 5 hours between when it crashed and when I started it back up.

        Show
        Joshua Bronson added a comment - couch crashed again while under heavy load early this morning. attaching an excerpt of the last 26 seconds of couch.log before the crash, 10:34:00 GMT to 10:34:26 GMT, in case it's more convenient than the pastebin link, all it indicates is the amount of load we're putting on couch – 3422 requests in 26 seconds (~130req/sec) is typical load over the course of the 2.5 hours we update the database with people's news. The 15:10:43 GMT timestamp of the last few lines of the excerpt reflects that couch was down for the 5 hours between when it crashed and when I started it back up.
        Hide
        Joshua Bronson added a comment -

        after attaching gdb to beam and triggering a crash by invoking a compaction, I got a backtrace with many lines like "#2073 0x00000000004a535b in enc_term ()". from what i gathered from damienkatz in #couchdb the problem is the term_to_binary function calls itself recursively for nested lists and tuples which can cause a stack overflow. i tried compacting after doubling stack size via "ulimit -s" and it completed before crashing (yay!) but then crashed and came back a few seconds later.

        Show
        Joshua Bronson added a comment - after attaching gdb to beam and triggering a crash by invoking a compaction, I got a backtrace with many lines like "#2073 0x00000000004a535b in enc_term ()". from what i gathered from damienkatz in #couchdb the problem is the term_to_binary function calls itself recursively for nested lists and tuples which can cause a stack overflow. i tried compacting after doubling stack size via "ulimit -s" and it completed before crashing (yay!) but then crashed and came back a few seconds later.
        Hide
        Damien Katz added a comment -

        This is patch to the Erlang vm for this crash. It fixes a problem with Erlang's term_to_binary code where it blows the C stack on deeply nested terms (ie deep trees). Example, this will crash any unpatched erlang VM:

        > term_to_binary(lists:foldl(fun(E,A) -> [E, A] end, [], lists:seq(1, 100000))).

        This patch fixes the Erlang vm by changing the the term_to_binary code from a recursive C implementation to one using it's own stack.

        Show
        Damien Katz added a comment - This is patch to the Erlang vm for this crash. It fixes a problem with Erlang's term_to_binary code where it blows the C stack on deeply nested terms (ie deep trees). Example, this will crash any unpatched erlang VM: > term_to_binary(lists:foldl(fun(E,A) -> [E, A] end, [], lists:seq(1, 100000))). This patch fixes the Erlang vm by changing the the term_to_binary code from a recursive C implementation to one using it's own stack.
        Hide
        Jan Lehnardt added a comment -

        Björn Gustavsson from the OTP Team replies:

        We will not include the suggested patch in OTP.

        However, we do intend to fix the problem as soon as possible,
        in a way that degrades performance as little as possible,
        possibly already in the R13B release.

        /Bjorn


        Björn Gustavsson, Erlang/OTP, Ericsson AB

        Show
        Jan Lehnardt added a comment - Björn Gustavsson from the OTP Team replies: We will not include the suggested patch in OTP. However, we do intend to fix the problem as soon as possible, in a way that degrades performance as little as possible, possibly already in the R13B release. /Bjorn – Björn Gustavsson, Erlang/OTP, Ericsson AB
        Hide
        Joshua Bronson added a comment -

        I have installed erlang R12B-5 on a development server, applied the attached patch, and can confirm that it prevents the term_to_binary snippet above from crashing erlang. did bjorn give any idea as to why the OTP team will not include this patch? would be good to know before we roll this out on our production server.

        Show
        Joshua Bronson added a comment - I have installed erlang R12B-5 on a development server, applied the attached patch, and can confirm that it prevents the term_to_binary snippet above from crashing erlang. did bjorn give any idea as to why the OTP team will not include this patch? would be good to know before we roll this out on our production server.
        Hide
        Joshua Bronson added a comment -

        Though the patch prevented the term_to_binary snippet from crashing Erlang, compacting still caused a crash. I had gdb attached while it happened but it didn't help:

        1. gdb program `pidof beam.smp`
          ...
          (gdb) continue
          Continuing.

        [hours pass while compaction progresses]

        Apache CouchDB crashed, restarting in 3 seconds.
        Program terminated with signal SIGKILL, Killed.
        The program no longer exists.
        (gdb) bt
        No stack.
        (gdb) # d'oh

        For some reason it didn't catch Erlang in the act of crashing so there was no opportunity to get a backtrace.

        Using ulimit to increase stack size for now.

        Show
        Joshua Bronson added a comment - Though the patch prevented the term_to_binary snippet from crashing Erlang, compacting still caused a crash. I had gdb attached while it happened but it didn't help: gdb program `pidof beam.smp` ... (gdb) continue Continuing. [hours pass while compaction progresses] Apache CouchDB crashed, restarting in 3 seconds. Program terminated with signal SIGKILL, Killed. The program no longer exists. (gdb) bt No stack. (gdb) # d'oh For some reason it didn't catch Erlang in the act of crashing so there was no opportunity to get a backtrace. Using ulimit to increase stack size for now.
        Hide
        Joshua Bronson added a comment -

        looks like this may have been fixed in the erlang/otp R13B release (just last week). from the release notes[1]:

        OTP-7894 The term_to_binary/1 BIF used to be implemented with
        recursive C code, which could cause the Erlang emulator to
        terminate because of a stack overflow.

        Also fixed some minor issues in term_to_binary/1 and
        binary_to_term/1 pointed out by Matthew Dempsky.

        Has anyone tried R13B yet?

        [1] http://erlang.org/download/otp_src_R13B.readme

        Show
        Joshua Bronson added a comment - looks like this may have been fixed in the erlang/otp R13B release (just last week). from the release notes [1] : OTP-7894 The term_to_binary/1 BIF used to be implemented with recursive C code, which could cause the Erlang emulator to terminate because of a stack overflow. Also fixed some minor issues in term_to_binary/1 and binary_to_term/1 pointed out by Matthew Dempsky. Has anyone tried R13B yet? [1] http://erlang.org/download/otp_src_R13B.readme
        Hide
        Joshua Bronson added a comment -

        We are running Gentoo system installs of CouchDB 0.9 and Erlang R13B on a new production server and are no longer seeing the crashes. The term_to_binary snippet:

        1> term_to_binary(lists:foldl(fun(E,A) -> [E, A] end, [], lists:seq(1, 100000))).

        also no longer causes Erlang to crash.

        Show
        Joshua Bronson added a comment - We are running Gentoo system installs of CouchDB 0.9 and Erlang R13B on a new production server and are no longer seeing the crashes. The term_to_binary snippet: 1> term_to_binary(lists:foldl(fun(E,A) -> [E, A] end, [], lists:seq(1, 100000))). also no longer causes Erlang to crash.
        Hide
        Jan Lehnardt added a comment -

        Erlang R13B fixes this issue.

        Show
        Jan Lehnardt added a comment - Erlang R13B fixes this issue.

          People

          • Assignee:
            Unassigned
            Reporter:
            Joshua Bronson
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development