CouchDB
  1. CouchDB
  2. COUCHDB-536

CouchDB HTTP server stops accepting connections

    Details

    • Type: Bug Bug
    • Status: Reopened
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.10, 1.1
    • Fix Version/s: None
    • Component/s: HTTP Interface
    • Labels:
      None
    • Environment:

      Ubuntu Linux 8.04 32bit and 64bit with Erlang R13B01
      or Ubuntu Linux 8.04 64bit with Erlang R14B02

    • Skill Level:
      Committers Level (Medium to Hard)

      Description

      Having 3 Couches all replicating a couple of databases to each other (pull replication with a update notification process) the HTTP service on any of the Couches stops working at some point (when running for a couple of ours with constant changes on all databases and servers).

      This is the error when a new HTTP request comes in:

      =ERROR REPORT==== 19-Oct-2009::10:18:55 ===
      application: mochiweb
      "Accept failed error"
      "

      {error,enfile}

      "
      [error] [<0.21619.12>] {error_report,<0.24.0>,
      {<0.21619.12>,crash_report,
      [[{initial_call,{mochiweb_socket_server,acceptor_loop,['Argument__1']}},

      {pid,<0.21619.12>}

      ,

      {registered_name,[]}

      ,
      {error_info,
      {exit,

      {error,accept_failed},
      [{mochiweb_socket_server,acceptor_loop,1},
      {proc_lib,init_p_do_apply,3}]}},
      {ancestors, [couch_httpd,couch_secondary_services,couch_server_sup,<0.1.0>]},
      {messages,[]},
      {links,[<0.66.0>]},
      {dictionary,[]},
      {trap_exit,false},
      {status,running},
      {heap_size,233},
      {stack_size,24},
      {reductions,202}],
      []]}}
      [error] [<0.66.0>] {error_report,<0.24.0>,
      {<0.66.0>,std_error,
      {mochiweb_socket_server,225,{acceptor_error,{error,accept_failed}

      }}}}

      To me this seems like it runs out of threads or sockets to handle the new connection or somewhat like this.

      Also i see in this setup that if i put lots of changes in a short time at some point the replication process hangs (never finishes) and when trying to restart the same replication once again is not possible and resulting in a timeout.

      1. couchdb_httpd_request_methods-week.png
        27 kB
        Simon Eisenmann
      2. couchdb_database_open-week.png
        19 kB
        Simon Eisenmann
      3. couchdb_httpd_response_codes-week.png
        19 kB
        Simon Eisenmann

        Activity

        Hide
        James Howe added a comment -

        Think I've hit the same thing in 1.2.0. After making lots of view queries (triggering a reindex on every one on the server with a script) the server got into a state where it would abort a random selection of requests (made at a more leisurely rate), including a number of those used by Futon. Didn't have a chance to investigate socket states on the server unfortunately.

        The first error was this:
        [Wed, 09 Jan 2013 16:00:03 GMT] [error] [<0.1528.3109>] {error_report,<0.32.0>,
        {<0.1528.3109>,std_error,
        [

        {application,mochiweb}

        ,
        "Accept failed error",
        "

        {error,enfile}

        "]}}

        Lots of these:
        [Wed, 09 Jan 2013 16:43:46 GMT] [error] [<0.617.3110>] {error_report,<0.32.0>,
        {<0.617.3110>,crash_report,
        [[{initial_call,
        {mochiweb_acceptor,init,
        ['Argument_1','Argument_2',
        'Argument__3']}},

        {pid,<0.617.3110>}

        ,

        {registered_name,[]}

        ,
        {error_info,
        {exit,

        {error,accept_failed},
        [{mochiweb_acceptor,init,3},
        {proc_lib,init_p_do_apply,3}]}},
        {ancestors, [couch_httpd,couch_secondary_services, couch_server_sup,<0.33.0>]},
        {messages,[]},
        {links,[<0.124.0>]},
        {dictionary,[]},
        {trap_exit,false},
        {status,running},
        {heap_size,233},
        {stack_size,24},
        {reductions,209}],
        []]}}

        And then eventually lots of these as well:
        [Wed, 09 Jan 2013 16:43:46 GMT] [error] [<0.124.0>] {error_report,<0.32.0>,
        {<0.124.0>,std_error,
        {mochiweb_socket_server,254,
        {acceptor_error,{error,accept_failed}

        }}}}

        Show
        James Howe added a comment - Think I've hit the same thing in 1.2.0. After making lots of view queries (triggering a reindex on every one on the server with a script) the server got into a state where it would abort a random selection of requests (made at a more leisurely rate), including a number of those used by Futon. Didn't have a chance to investigate socket states on the server unfortunately. The first error was this: [Wed, 09 Jan 2013 16:00:03 GMT] [error] [<0.1528.3109>] {error_report,<0.32.0>, {<0.1528.3109>,std_error, [ {application,mochiweb} , "Accept failed error", " {error,enfile} "]}} Lots of these: [Wed, 09 Jan 2013 16:43:46 GMT] [error] [<0.617.3110>] {error_report,<0.32.0>, {<0.617.3110>,crash_report, [[{initial_call, {mochiweb_acceptor,init, ['Argument_ 1','Argument _2', 'Argument__3']}}, {pid,<0.617.3110>} , {registered_name,[]} , {error_info, {exit, {error,accept_failed}, [{mochiweb_acceptor,init,3}, {proc_lib,init_p_do_apply,3}]}}, {ancestors, [couch_httpd,couch_secondary_services, couch_server_sup,<0.33.0>]}, {messages,[]}, {links,[<0.124.0>]}, {dictionary,[]}, {trap_exit,false}, {status,running}, {heap_size,233}, {stack_size,24}, {reductions,209}], []]}} And then eventually lots of these as well: [Wed, 09 Jan 2013 16:43:46 GMT] [error] [<0.124.0>] {error_report,<0.32.0>, {<0.124.0>,std_error, {mochiweb_socket_server,254, {acceptor_error,{error,accept_failed} }}}}
        Simon Eisenmann made changes -
        Attachment couchdb_httpd_request_methods-week.png [ 12487163 ]
        Simon Eisenmann made changes -
        Attachment couchdb_database_open-week.png [ 12487162 ]
        Simon Eisenmann made changes -
        Attachment couchdb_httpd_response_codes-week.png [ 12487161 ]
        Simon Eisenmann made changes -
        Comment [ Response code statistic. ]
        Simon Eisenmann made changes -
        Attachment couchdb_httpd_response_codes-week.png [ 12487161 ]
        Simon Eisenmann made changes -
        Attachment couchdb_httpd_response_codes-week.png [ 12487160 ]
        Hide
        Simon Eisenmann added a comment - - edited

        All right, got some statistics data. Looks like that some service crashed and was restarted. After that point, it ran wild for a couple of hours and then stops responding. When looking at the image which basically counts the number of various http responses for the Couch from _stats. The first big drop of numbers was the crash. The second drop was the manual restart. Between these two, it seems to loose file pointers.

        I have uploaded several graphs, showing a single crash event after which it starts to act crazy and seems to run out of file pointers.

        Show
        Simon Eisenmann added a comment - - edited All right, got some statistics data. Looks like that some service crashed and was restarted. After that point, it ran wild for a couple of hours and then stops responding. When looking at the image which basically counts the number of various http responses for the Couch from _stats. The first big drop of numbers was the crash. The second drop was the manual restart. Between these two, it seems to loose file pointers. I have uploaded several graphs, showing a single crash event after which it starts to act crazy and seems to run out of file pointers.
        Hide
        Derek Perez added a comment -

        I have this error as well, couchdb 1.0.2 proxying SSL through stunnel:

        [Tue, 19 Jul 2011 20:04:00 GMT] [error] [<0.103.0>] {error_report,<0.32.0>,
        {<0.103.0>,std_error,
        {mochiweb_socket_server,225,{acceptor_error,

        {error,accept_failed}}}}}

        [Tue, 19 Jul 2011 20:04:00 GMT] [error] [<0.20541.14>] {error_report,<0.32.0>,
        {<0.20541.14>,std_error,
        [{application,mochiweb},
        "Accept failed error","{error,emfile}"]}}

        [Tue, 19 Jul 2011 20:04:00 GMT] [error] [<0.20541.14>] {error_report,<0.32.0>,
        {<0.20541.14>,crash_report,
        [[{initial_call,{mochiweb_socket_server,acceptor_loop,['Argument__1']}},
        {pid,<0.20541.14>},
        {registered_name,[]},
        {error_info,
        {exit,
        {error,accept_failed}

        ,
        [

        {mochiweb_socket_server,acceptor_loop,1}

        ,

        {proc_lib,init_p_do_apply,3}

        ]}},

        {ancestors, [couch_httpd,couch_secondary_services,couch_server_sup,<0.33.0>]}

        ,

        {messages,[]}

        ,

        {links,[<0.103.0>]}

        ,

        {dictionary,[]}

        ,

        {trap_exit,false}

        ,

        {status,running}

        ,

        {heap_size,233}

        ,

        {stack_size,24}

        ,

        {reductions,197}

        ],
        []]}}

        Show
        Derek Perez added a comment - I have this error as well, couchdb 1.0.2 proxying SSL through stunnel: [Tue, 19 Jul 2011 20:04:00 GMT] [error] [<0.103.0>] {error_report,<0.32.0>, {<0.103.0>,std_error, {mochiweb_socket_server,225,{acceptor_error, {error,accept_failed}}}}} [Tue, 19 Jul 2011 20:04:00 GMT] [error] [<0.20541.14>] {error_report,<0.32.0>, {<0.20541.14>,std_error, [{application,mochiweb}, "Accept failed error","{error,emfile}"]}} [Tue, 19 Jul 2011 20:04:00 GMT] [error] [<0.20541.14>] {error_report,<0.32.0>, {<0.20541.14>,crash_report, [[{initial_call,{mochiweb_socket_server,acceptor_loop, ['Argument__1'] }}, {pid,<0.20541.14>}, {registered_name,[]}, {error_info, {exit, {error,accept_failed} , [ {mochiweb_socket_server,acceptor_loop,1} , {proc_lib,init_p_do_apply,3} ]}}, {ancestors, [couch_httpd,couch_secondary_services,couch_server_sup,<0.33.0>]} , {messages,[]} , {links,[<0.103.0>]} , {dictionary,[]} , {trap_exit,false} , {status,running} , {heap_size,233} , {stack_size,24} , {reductions,197} ], []]}}
        Hide
        Simon Eisenmann added a comment -

        I had this again last night and did some further checks and found that there were around 800 sockets in CLOSE_WAIT state for couchdb process. Though this does not seem to be related to COUCHDB-1100 as i do not have any large views which timeout while updating.

        Show
        Simon Eisenmann added a comment - I had this again last night and did some further checks and found that there were around 800 sockets in CLOSE_WAIT state for couchdb process. Though this does not seem to be related to COUCHDB-1100 as i do not have any large views which timeout while updating.
        Hide
        Pasi Eronen added a comment -

        If netstat showed lots of connections (which consume file handles, too) , this could be related to COUCHDB-1100?

        Show
        Pasi Eronen added a comment - If netstat showed lots of connections (which consume file handles, too) , this could be related to COUCHDB-1100 ?
        Hide
        Simon Eisenmann added a comment -

        Just did some further digging and found these in the logs as well:
        [Fri, 08 Jul 2011 04:06:22 GMT] [error] [<0.20.0>] {error_report,<0.9.0>,
        {<0.20.0>,std_error,
        "File operation error: system_limit. Target: .
        /lib.beam. Function: get_file. Process: code_server."}}

        Looks like it run out of file handles?

        Show
        Simon Eisenmann added a comment - Just did some further digging and found these in the logs as well: [Fri, 08 Jul 2011 04:06:22 GMT] [error] [<0.20.0>] {error_report,<0.9.0>, {<0.20.0>,std_error, "File operation error: system_limit. Target: . /lib.beam. Function: get_file. Process: code_server."}} Looks like it run out of file handles?
        Simon Eisenmann made changes -
        Affects Version/s 1.1 [ 12314933 ]
        Environment Ubuntu Linux 8.04 32bit and 64bit with Erlang R13B01 Ubuntu Linux 8.04 32bit and 64bit with Erlang R13B01
        or Ubuntu Linux 8.04 64bit with Erlang R14B02
        Priority Critical [ 2 ] Major [ 3 ]
        Simon Eisenmann made changes -
        Resolution Incomplete [ 4 ]
        Status Closed [ 6 ] Reopened [ 4 ]
        Skill Level Committers Level (Medium to Hard)
        Hide
        Simon Eisenmann added a comment -

        All right i got this issue again on one of the nodes in the cluster. The software is now CouchDB 1.1.0 with Erlang R14B02.

        After a couple of hours replicating from 3 other nodes and constant changes on the local node it stopps accepting HTTP (see error below).

        I have checked with netstat and also saw lots of connections using the CouchDB port.

        It only happens on one node on the cluster though. I keep monitoring if that happens every day. I had a similar issue (replication did hang at some point) but thought this to be related to stunnel as there was no trace in the couch. Yesterday i have switched to native CouchDB SSL and now there is this trace.

        [Fri, 08 Jul 2011 04:06:22 GMT] [error] [<0.10266.14>] {error_report,<0.31.0>,
        {<0.10266.14>,std_error,
        [

        {application,mochiweb}

        ,
        "Accept failed error",
        "

        {error,enfile}

        "]}}
        [Fri, 08 Jul 2011 04:06:22 GMT] [error] [<0.10266.14>] {error_report,<0.31.0>,
        {<0.10266.14>,crash_report,
        [[{initial_call,
        {mochiweb_acceptor,init,
        ['Argument_1','Argument_2',
        'Argument__3']}},

        {pid,<0.10266.14>}

        ,

        {registered_name,[]}

        ,
        {error_info,
        {exit,

        {error,accept_failed}

        ,
        [

        {mochiweb_acceptor,init,3}

        ,

        {proc_lib,init_p_do_apply,3}

        ]}},

        {ancestors, [https,couch_secondary_services, couch_server_sup,<0.32.0>]}

        ,

        {messages,[]}

        ,

        {links,[<0.136.0>]}

        ,

        {dictionary,[]}

        ,

        {trap_exit,false}

        ,

        {status,running}

        ,

        {heap_size,233}

        ,

        {stack_size,24}

        ,

        {reductions,372}

        ],
        []]}}

        Show
        Simon Eisenmann added a comment - All right i got this issue again on one of the nodes in the cluster. The software is now CouchDB 1.1.0 with Erlang R14B02. After a couple of hours replicating from 3 other nodes and constant changes on the local node it stopps accepting HTTP (see error below). I have checked with netstat and also saw lots of connections using the CouchDB port. It only happens on one node on the cluster though. I keep monitoring if that happens every day. I had a similar issue (replication did hang at some point) but thought this to be related to stunnel as there was no trace in the couch. Yesterday i have switched to native CouchDB SSL and now there is this trace. [Fri, 08 Jul 2011 04:06:22 GMT] [error] [<0.10266.14>] {error_report,<0.31.0>, {<0.10266.14>,std_error, [ {application,mochiweb} , "Accept failed error", " {error,enfile} "]}} [Fri, 08 Jul 2011 04:06:22 GMT] [error] [<0.10266.14>] {error_report,<0.31.0>, {<0.10266.14>,crash_report, [[{initial_call, {mochiweb_acceptor,init, ['Argument_ 1','Argument _2', 'Argument__3']}}, {pid,<0.10266.14>} , {registered_name,[]} , {error_info, {exit, {error,accept_failed} , [ {mochiweb_acceptor,init,3} , {proc_lib,init_p_do_apply,3} ]}}, {ancestors, [https,couch_secondary_services, couch_server_sup,<0.32.0>]} , {messages,[]} , {links,[<0.136.0>]} , {dictionary,[]} , {trap_exit,false} , {status,running} , {heap_size,233} , {stack_size,24} , {reductions,372} ], []]}}
        Hide
        Stefan Kögl added a comment -

        The same just happened to me with version 1.0.2.

        [Sun, 10 Apr 2011 14:08:33 GMT] [error] [<0.7746.77>] {error_report,<0.30.0>,
        {<0.7746.77>,crash_report,
        [[{initial_call,{mochiweb_socket_server,acceptor_loop,['Argument__1']}},

        {pid,<0.7746.77>}

        ,

        {registered_name,[]}

        ,
        {error_info,
        {exit,

        {error,accept_failed}

        ,
        [

        {mochiweb_socket_server,acceptor_loop,1}

        ,

        {proc_lib,init_p_do_apply,3}

        ]}},

        {ancestors, [couch_httpd,couch_secondary_services,couch_server_sup,<0.31.0>]}

        ,

        {messages,[]}

        ,

        {links,[<0.102.0>]}

        ,

        {dictionary,[]}

        ,

        {trap_exit,false}

        ,

        {status,running}

        ,

        {heap_size,233}

        ,

        {stack_size,24}

        ,

        {reductions,200}

        ],
        []]}}

        Show
        Stefan Kögl added a comment - The same just happened to me with version 1.0.2. [Sun, 10 Apr 2011 14:08:33 GMT] [error] [<0.7746.77>] {error_report,<0.30.0>, {<0.7746.77>,crash_report, [[{initial_call,{mochiweb_socket_server,acceptor_loop, ['Argument__1'] }}, {pid,<0.7746.77>} , {registered_name,[]} , {error_info, {exit, {error,accept_failed} , [ {mochiweb_socket_server,acceptor_loop,1} , {proc_lib,init_p_do_apply,3} ]}}, {ancestors, [couch_httpd,couch_secondary_services,couch_server_sup,<0.31.0>]} , {messages,[]} , {links,[<0.102.0>]} , {dictionary,[]} , {trap_exit,false} , {status,running} , {heap_size,233} , {stack_size,24} , {reductions,200} ], []]}}
        Jan Lehnardt made changes -
        Field Original Value New Value
        Status Open [ 1 ] Closed [ 6 ]
        Resolution Incomplete [ 4 ]
        Hide
        Jan Lehnardt added a comment -

        waiting for feedback, please reopen if this still persists.

        Show
        Jan Lehnardt added a comment - waiting for feedback, please reopen if this still persists.
        Hide
        Jan Lehnardt added a comment -

        does this still happen with 0.11.0 or turnk?

        Show
        Jan Lehnardt added a comment - does this still happen with 0.11.0 or turnk?
        Simon Eisenmann created issue -

          People

          • Assignee:
            Unassigned
            Reporter:
            Simon Eisenmann
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:

              Development