CouchDB
  1. CouchDB
  2. COUCHDB-963

Erlang processes crash when running the delayed_commits test on Windows Server 2008

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.0.1
    • Fix Version/s: None
    • Component/s: Infrastructure
    • Labels:
      None
    • Environment:

      This Windows box is a virtual machine
      Windows Server 2008 Standard without Hyper-V Service Pack 2 64-bit
      2 GB RAM
      2 Core Intel Xeon CPU @ 2.53GHz each

      Description

      The debugging I've done points to this being an erlsrv.exe bug. Here my steps to recreate.

      Install 1.0.1 CouchDB as a service using the Windows Binary Installer. I did not select to "Start service after installation".
      Edit the local.ini to set the logging level to "debug".
      Go to the service control panel and start the Apache CouchDB service.
      Go to Test Suite in Futon and run the "delayed_commits" test.

      After about 15 - 20 seconds go to the service control panel and refresh to see that the service is no longer running. ProcessExplorer verifies the erlsrv.exe and erl.exe processes are not running. The last message in the log is a _restart command that returns 200.

      When I run CouchDB using CouchDB.bat. The test completes without crashing.

      When I set the DebugType in the registry HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\Ericsson\Erlang\ErlSrv\1.1\Apache CouchDB to 1 (DEBUG_TYPE_NEW) to get a erlsrv.exe log, the test completes without crashing.

      I will attach the log files from CouchDB and erlsrv.exe.

      1. Apache CouchDB.debug.2
        120 kB
        Terry Smith
      2. couch.log
        655 kB
        Terry Smith
      3. couchdb_init_restart_OTP-9139_minimal.patch
        14 kB
        Dave Cottlehuber
      4. COUCHDB-963_workaround_and_improved_uninstall.patch
        2 kB
        Dave Cottlehuber
      5. dch_couch.log
        1.27 MB
        Dave Cottlehuber

        Activity

        Hide
        Paul Joseph Davis added a comment -

        Applied to trunk and backported to 1.1.x and 1.0.x

        Show
        Paul Joseph Davis added a comment - Applied to trunk and backported to 1.1.x and 1.0.x
        Hide
        Dave Cottlehuber added a comment -

        committed to trunk r1083714 | rnewson | 2011-03-21 22:28:57 +1300 (Mon, 21 Mar 2011)

        Show
        Dave Cottlehuber added a comment - committed to trunk r1083714 | rnewson | 2011-03-21 22:28:57 +1300 (Mon, 21 Mar 2011)
        Hide
        Dave Cottlehuber added a comment -

        FYI: upstream fix for COUCHDB-963 (OTP-9139) is attached thx pan@erlang.

        It applies cleanly to Erlang/OTP R14B02 & will most likely be included in R14B03.

        Show
        Dave Cottlehuber added a comment - FYI: upstream fix for COUCHDB-963 (OTP-9139) is attached thx pan@erlang. It applies cleanly to Erlang/OTP R14B02 & will most likely be included in R14B03.
        Hide
        Dave Cottlehuber added a comment - - edited

        workaround for COUCHDB-963 as permanent fix is needed from upstream.

        patch contains 5 tweaks:

        • add "-onfail restart_always" to service parameters to increase reliability
        • bring service Erlang VM parameters in line with those in couchdb.bat
        • improved uninstall by killing epmd first; this allows all binaries to be removed cleanly
        • improved install in service mode by hiding popup erlsrv.exe consoles
        • use full commandline options for erlsrv to make installer more readable
        Show
        Dave Cottlehuber added a comment - - edited workaround for COUCHDB-963 as permanent fix is needed from upstream. patch contains 5 tweaks: add "-onfail restart_always" to service parameters to increase reliability bring service Erlang VM parameters in line with those in couchdb.bat improved uninstall by killing epmd first; this allows all binaries to be removed cleanly improved install in service mode by hiding popup erlsrv.exe consoles use full commandline options for erlsrv to make installer more readable
        Hide
        Adam Kocoloski added a comment -

        Summarizing our IRC conversation, I'm +1 on setting the autorestart bit on the daemonized version. Also +1 on submitting a bug report to OTP regarding init:restart() and erl.exe/erlsrv.exe

        Show
        Adam Kocoloski added a comment - Summarizing our IRC conversation, I'm +1 on setting the autorestart bit on the daemonized version. Also +1 on submitting a bug report to OTP regarding init:restart() and erl.exe/erlsrv.exe
        Hide
        Dave Cottlehuber added a comment -

        Looks like a suitable workaround will be to adapt erlsrv.exe build process (in windows.iss.tpl) to set onfail restart service.

        Show
        Dave Cottlehuber added a comment - Looks like a suitable workaround will be to adapt erlsrv.exe build process (in windows.iss.tpl) to set onfail restart service.
        Hide
        Dave Cottlehuber added a comment -

        Last note for the night - there is a difference between the service & the .bat file. batch uses werl.exe & service uses erlsrv.exe & erl.exe. There's 2 commits that might help resolve this:

        Jan's parameters in couchdb.bat are not (yet) included in the service config:
        https://github.com/apache/couchdb/commit/bbd0703f769bac618c0dec22ebf2d14b0a5df5b8
        needs to be included into https://github.com/apache/couchdb/blob/trunk/etc/windows/couchdb.iss.tpl#L77

        Damien's older "use werl.exe not erl.exe":
        https://github.com/apache/couchdb/blob/trunk/bin/couchdb.bat.tpl.in#L23

        Some things to try;

        • running init:restart() from the erl or werl console directly
          -changing the service parameters to match Jan's commit
        • use psexec -s .... (from sysinternals.com) to launch an interactive prompt under system acct, and try couchdb.bat from there?
          using the config from the service instead of the batch file - does it fail under normal user acct as well?

        Cheers
        Dave

        Show
        Dave Cottlehuber added a comment - Last note for the night - there is a difference between the service & the .bat file. batch uses werl.exe & service uses erlsrv.exe & erl.exe. There's 2 commits that might help resolve this: Jan's parameters in couchdb.bat are not (yet) included in the service config: https://github.com/apache/couchdb/commit/bbd0703f769bac618c0dec22ebf2d14b0a5df5b8 needs to be included into https://github.com/apache/couchdb/blob/trunk/etc/windows/couchdb.iss.tpl#L77 Damien's older "use werl.exe not erl.exe": https://github.com/apache/couchdb/blob/trunk/bin/couchdb.bat.tpl.in#L23 Some things to try; running init:restart() from the erl or werl console directly -changing the service parameters to match Jan's commit use psexec -s .... (from sysinternals.com) to launch an interactive prompt under system acct, and try couchdb.bat from there? using the config from the service instead of the batch file - does it fail under normal user acct as well? Cheers Dave
        Hide
        Dave Cottlehuber added a comment -

        couch log from DaveCottlehuber

        Show
        Dave Cottlehuber added a comment - couch log from DaveCottlehuber
        Hide
        Dave Cottlehuber added a comment -

        Also replicated, EC2 large instance, w2008 r1sp2 datacenter ed.

        My reading is that this is an erlang vm problem.

        Not all restarts fail:
        [Tue, 08 Feb 2011 08:16:25 GMT] [info] [<0.362.0>] 125.236.236.206 - - 'POST' /_restart 200

        [Tue, 08 Feb 2011 08:16:27 GMT] [info] [<0.398.0>] Apache CouchDB has started on http://0.0.0.0:5984/

        but others do - 7 minute delay is my manual restart coming in:

        [Tue, 08 Feb 2011 08:23:37 GMT] [debug] [<0.2056.0>] 'POST' /_restart

        {1,1}

        Headers: [

        {'Accept',"application/json"}

        ,

        {'Accept-Charset',"ISO-8859-1,utf-8;q=0.7,*;q=0.7"}

        ,

        {'Accept-Encoding',"gzip, deflate"}

        ,

        {'Accept-Language',"en-us,en;q=0.5"}

        ,

        {'Cache-Control',"no-cache"}

        ,

        {'Connection',"keep-alive"}

        ,

        {'Content-Length',"0"}

        ,

        {'Content-Type',"application/json; charset=UTF-8"}

        ,

        {'Cookie',"AuthSession="}

        ,

        {'Host',"ec2-204-236-204-144.compute-1.amazonaws.com:5984"}

        ,

        {'Keep-Alive',"115"}

        ,

        {'Pragma',"no-cache"}

        ,

        {'Referer',"http://ec2-204-236-204-144.compute-1.amazonaws.com:5984/_utils/couch_tests.html?script/couch_tests.js"}

        ,

        {'User-Agent',"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0b10) Gecko/20100101 Firefox/4.0b10"}

        ]

        [Tue, 08 Feb 2011 08:23:37 GMT] [debug] [<0.2056.0>] OAuth Params: []

        [Tue, 08 Feb 2011 08:23:37 GMT] [info] [<0.2056.0>] 125.236.236.206 - - 'POST' /_restart 200

        [Tue, 08 Feb 2011 08:30:30 GMT] [info] [<0.34.0>] Apache CouchDB has started on http://0.0.0.0:5984/

        So it looks as if couchdb shuts down cleanly through;:

        handle_restart_req(#httpd

        {method='POST'}

        =Req) ->
        couch_httpd:validate_ctype(Req, "application/json"),
        ok = couch_httpd:verify_is_server_admin(Req),
        couch_server_sup:restart_core_server(),

        which is short & sweet:
        restart_core_server() ->
        init:restart().

        leaving couch now & into (I think) the vm:

        restart() -> init !

        {stop,restart}

        , ok.

        Show
        Dave Cottlehuber added a comment - Also replicated, EC2 large instance, w2008 r1sp2 datacenter ed. My reading is that this is an erlang vm problem. Not all restarts fail: [Tue, 08 Feb 2011 08:16:25 GMT] [info] [<0.362.0>] 125.236.236.206 - - 'POST' /_restart 200 [Tue, 08 Feb 2011 08:16:27 GMT] [info] [<0.398.0>] Apache CouchDB has started on http://0.0.0.0:5984/ but others do - 7 minute delay is my manual restart coming in: [Tue, 08 Feb 2011 08:23:37 GMT] [debug] [<0.2056.0>] 'POST' /_restart {1,1} Headers: [ {'Accept',"application/json"} , {'Accept-Charset',"ISO-8859-1,utf-8;q=0.7,*;q=0.7"} , {'Accept-Encoding',"gzip, deflate"} , {'Accept-Language',"en-us,en;q=0.5"} , {'Cache-Control',"no-cache"} , {'Connection',"keep-alive"} , {'Content-Length',"0"} , {'Content-Type',"application/json; charset=UTF-8"} , {'Cookie',"AuthSession="} , {'Host',"ec2-204-236-204-144.compute-1.amazonaws.com:5984"} , {'Keep-Alive',"115"} , {'Pragma',"no-cache"} , {'Referer',"http://ec2-204-236-204-144.compute-1.amazonaws.com:5984/_utils/couch_tests.html?script/couch_tests.js"} , {'User-Agent',"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0b10) Gecko/20100101 Firefox/4.0b10"} ] [Tue, 08 Feb 2011 08:23:37 GMT] [debug] [<0.2056.0>] OAuth Params: [] [Tue, 08 Feb 2011 08:23:37 GMT] [info] [<0.2056.0>] 125.236.236.206 - - 'POST' /_restart 200 [Tue, 08 Feb 2011 08:30:30 GMT] [info] [<0.34.0>] Apache CouchDB has started on http://0.0.0.0:5984/ So it looks as if couchdb shuts down cleanly through;: handle_restart_req(#httpd {method='POST'} =Req) -> couch_httpd:validate_ctype(Req, "application/json"), ok = couch_httpd:verify_is_server_admin(Req), couch_server_sup:restart_core_server(), which is short & sweet: restart_core_server() -> init:restart(). leaving couch now & into (I think) the vm: restart() -> init ! {stop,restart} , ok.
        Hide
        Terry Smith added a comment -

        I forgot to mention that I tested running the service as an administrator user and had also set the permissions on the CouchDB database_dir and view_index_dir to allow "Everyone" full access. I had also moved the database_dir and view_index_dir to be located in a folder other than c:\Program Files(x86)\Apache Software Foundation\CouchDB\var\lib\couchdb in case there may be a problem with access to folders in c:\Program Files(x86). These changes made no difference.

        Only having the console output going to the console or to a file allowed the test to complete without crashing.

        Show
        Terry Smith added a comment - I forgot to mention that I tested running the service as an administrator user and had also set the permissions on the CouchDB database_dir and view_index_dir to allow "Everyone" full access. I had also moved the database_dir and view_index_dir to be located in a folder other than c:\Program Files(x86)\Apache Software Foundation\CouchDB\var\lib\couchdb in case there may be a problem with access to folders in c:\Program Files(x86). These changes made no difference. Only having the console output going to the console or to a file allowed the test to complete without crashing.
        Hide
        Terry Smith added a comment -

        couch.log is the CouchDB debug log and Apache CouchDB.debug.2 is the erlsrv.exe log

        Show
        Terry Smith added a comment - couch.log is the CouchDB debug log and Apache CouchDB.debug.2 is the erlsrv.exe log

          People

          • Assignee:
            Unassigned
            Reporter:
            Terry Smith
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development