CouchDB
  1. CouchDB
  2. COUCHDB-1346

CouchDB hangs during start of view indexing

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 1.3
    • Fix Version/s: 1.3
    • Component/s: View Server Support
    • Labels:
    • Environment:

      Windows 7 Enterprise only, not able to replicate on Mac OS X.
      Erlang R14B03 + crypto patches.
      Mozilla Javascript 1.8.5

    • Skill Level:
      Guru Level (Everyone buy this person a beer at the next conference!)

      Description

      [info] [<0.20499.0>] Opening index for db: test_suite_db idx: f4421bf4e9c9bf2acb3db91bca9e9adc sig: "d5c87ad33242b181f86be2139cbccd96"
      [info] [<0.20504.0>] Starting index update for db: test_suite_db idx: f4421bf4e9c9bf2acb3db91bca9e9adc
      [info] [<0.20334.0>] 172.16.40.1 - - POST /test_suite_db/_temp_view 500
      [info] [<0.20513.0>] 172.16.40.1 - - GET /_utils/couch_tests.html?script/couch_tests.js 200
      [info] [<0.20514.0>] 172.16.40.1 - - GET /_utils/index.html 200
      [info] [<0.20060.0>] 172.16.40.1 - - DELETE /test_suite_db_a/ 200
      [info] [<0.20407.0>] 172.16.40.1 - - GET /test_suite_reports/ 404
      [info] [<0.20058.0>] 172.16.40.1 - - DELETE /test_suite_db/ 404
      [info] [<0.20071.0>] 172.16.40.1 - - DELETE /test_suite_db/ 404
      [info] [<0.20069.0>] 172.16.40.1 - - DELETE /test_suite_db/ 404
      [info] [<0.20484.0>] 172.16.40.1 - - DELETE /test_suite_db/ 404
      [info] [<0.20364.0>] 172.16.40.1 - - DELETE /test_suite_db/ 404
      [info] [<0.20062.0>] 172.16.40.1 - - DELETE /test_suite_db/ 404
      [info] [<0.20388.0>] 172.16.40.1 - - DELETE /test_suite_db/ 404
      [info] [<0.20345.0>] 172.16.40.1 - - DELETE /test_suite_db/ 404
      [info] [<0.20072.0>] 172.16.40.1 - - DELETE /test_suite_db/ 404
      [info] [<0.20059.0>] 172.16.40.1 - - DELETE /test_suite_db/ 404
      [info] [<0.20061.0>] 172.16.40.1 - - DELETE /test_suite_db/ 404
      [info] [<0.20472.0>] 172.16.40.1 - - DELETE /test_suite_db/ 200
      [error] [<0.20050.0>] ** Generic server couch_index_server terminating

        • Last message in was {'$gen_cast',
          Unknown macro: {reset_indexes,<<"test_suite_db">>}

          }

        • When Server state == {st,"../var/lib/couchdb"}
          ** Reason for termination ==
          ** case_clause,{error,eacces,
          [{couch_file,'-nuke_dir/2-fun-0-',3},
          {lists,foreach,2},
          {couch_file,nuke_dir,2},
          {couch_index_server,handle_cast,2},
          {gen_server,handle_msg,5},
          {proc_lib,init_p_do_apply,3}]}
          =ERROR REPORT==== 23-Nov-2011::21:17:14 ===
          ** Generic server couch_index_server terminating
          ** Last message in was {'$gen_cast',{reset_indexes,<<"test_suite_db">>}}
          ** When Server state == {st,"../var/lib/couchdb"}
        • Reason for termination ==
        • case_clause,{error,eacces,
          [ {couch_file,'-nuke_dir/2-fun-0-',3},
          {lists,foreach,2},
          {couch_file,nuke_dir,2},
          {couch_index_server,handle_cast,2},
          {gen_server,handle_msg,5},
          {proc_lib,init_p_do_apply,3}]}
          [error] [<0.20050.0>] {error_report,<0.19957.0>,
          {<0.20050.0>,crash_report,
          [[{initial_call,
          {couch_index_server,init,['Argument__1']}},
          {pid,<0.20050.0>},
          {registered_name,couch_index_server},
          {error_info,
          {exit,
          case_clause,{error,eacces,
          [{couch_file,'-nuke_dir/2-fun-0-',3}

          ,

          {lists,foreach,2},
          {couch_file,nuke_dir,2},
          {couch_index_server,handle_cast,2},
          {gen_server,handle_msg,5},
          {proc_lib,init_p_do_apply,3}]},
          [{gen_server,terminate,6},
          {proc_lib,init_p_do_apply,3}]}},
          {ancestors, [couch_secondary_services,couch_server_sup, <0.19958.0>]},
          {messages,
          [{'$gen_cast',
          {reset_indexes,<<"test_suite_db_a">>}}]},
          {links,[<0.20051.0>,<0.20026.0>]},
          {dictionary,[]},
          {trap_exit,true},
          {status,running},
          {heap_size,1597},
          {stack_size,24},
          {reductions,12211}],
          [{neighbour,
          [{pid,<0.20051.0>},
          {registered_name,[]},
          {initial_call,
          {couch_event_sup,init,['Argument__1']}},
          {current_function,{gen_server,loop,6}},
          {ancestors, [couch_index_server, couch_secondary_services, couch_server_sup,<0.19958.0>]},
          {messages,[]},
          {links,[<0.20050.0>,<0.20018.0>]},
          {dictionary,[]},
          {trap_exit,false},
          {status,waiting},
          {heap_size,233},
          {stack_size,9},
          {reductions,32}]}]]}}
          =CRASH REPORT==== 23-Nov-2011::21:17:14 ===
          crasher:
          initial call: couch_index_server:init/1
          pid: <0.20050.0>
          registered_name: couch_index_server
          exception exit: case_clause,{error,eacces,
          [{couch_file,'-nuke_dir/2-fun-0-',3},
          {lists,foreach,2}

          ,

          {couch_file,nuke_dir,2},
          {couch_index_server,handle_cast,2},
          {gen_server,handle_msg,5},
          {proc_lib,init_p_do_apply,3}]}
          in function gen_server:terminate/6
          ancestors: [couch_secondary_services,couch_server_sup,<0.19958.0>]
          messages: [{'$gen_cast',{reset_indexes,<<"test_suite_db_a">>}}]
          links: [<0.20051.0>,<0.20026.0>]
          dictionary: []
          trap_exit: true
          status: running
          heap_size: 1597
          stack_size: 24
          reductions: 12211
          neighbours:
          neighbour: [{pid,<0.20051.0>},
          {registered_name,[]},
          {initial_call,{couch_event_sup,init,['Argument__1']}},
          {current_function,{gen_server,loop,6}},
          {ancestors,[couch_index_server,couch_secondary_services, couch_server_sup,<0.19958.0>]},
          {messages,[]},
          {links,[<0.20050.0>,<0.20018.0>]},
          {dictionary,[]},
          {trap_exit,false},
          {status,waiting},
          {heap_size,233},
          {stack_size,9},
          {reductions,32}]
          [error] [<0.20026.0>] {error_report,<0.19957.0>,
          {<0.20026.0>,supervisor_report,
          [{supervisor,{local,couch_secondary_services}},
          {errorContext,child_terminated},
          {reason,
          case_clause,{error,eacces,
          [{couch_file,'-nuke_dir/2-fun-0-',3},
          {lists,foreach,2},
          {couch_file,nuke_dir,2}

          ,

          {couch_index_server,handle_cast,2}

          ,

          {gen_server,handle_msg,5}

          ,

          {proc_lib,init_p_do_apply,3}

          ]}},
          {offender,
          [

          {pid,<0.20050.0>}

          ,

          {name,index_server}

          ,
          {mfargs,{couch_index_server,start_link,[]}},

          {restart_type,permanent}

          ,

          {shutdown,brutal_kill}

          ,

          {child_type,worker}

          ]}]}}

      OS process tree at this time is:
      Process information for SENDAI:

      Name Pid Pri Thd Hnd VM WS Priv
      Idle 0 0 2 0 0 24 0
      System 4 8 79 477 3380 304 108
      explorer 1984 8 21 664 213732 46340 21540
      cmd 2104 8 1 25 48132 3304 2144
      pslist 2776 13 1 133 63584 4976 2000
      cmd 2504 8 1 26 44980 3512 3012
      werl 2680 8 16 390 196232 40064 28628
      win32sysinfo 1152 8 1 21 12624 2124 640
      couchspawnkillable 1444 8 1 30 12992 2284 688
      couchjs 1468 8 1 39 55900 6572 4056
      couchspawnkillable 2740 8 1 30 12992 2280 684
      couchjs 2756 8 1 39 55900 7108 4444

      Erlang resumes running CouchDB when couchjs procs are terminated with extreme
      prejudice. The hang still occurs after reverting fdmanana's COUCHDB-1334
      commit. This could be a race condition during invalidation of the views, and
      subsequent deletion of the related ddoc view directory prior to reindexing.
      On Windows a filesystem object cannot be deleted if there are open handles
      remaining.

        Issue Links

          Activity

          Dave Cottlehuber created issue -
          Hide
          Dave Cottlehuber added a comment -

          Likely this issue was introduced in COUCHDB-1270 when nuke_dir appeared, git commit c66ebaab.

          Show
          Dave Cottlehuber added a comment - Likely this issue was introduced in COUCHDB-1270 when nuke_dir appeared, git commit c66ebaab.
          Hide
          Joan Touzet added a comment -

          As discussed with rnewson on IRC, this is the classic Windows vs. Unix file removal issue - not an attempt to remote a directory using file:delete/1.

          UNIX allows you to delete a file that is still in use. The directory entry is removed, but the inode is still marked as in use until the last file handle is closed, at which time the disk space is freed. Windows doesn't allow removal of a file with open file handles until they are all closed, returning EACCES upon delete request instead.

          A quick fix would be to progressively back off and retry the delete a few times, if this is just timing-related. If not, more care is going to have to be taken to avoid the race/etc.

          Show
          Joan Touzet added a comment - As discussed with rnewson on IRC, this is the classic Windows vs. Unix file removal issue - not an attempt to remote a directory using file:delete/1 . UNIX allows you to delete a file that is still in use. The directory entry is removed, but the inode is still marked as in use until the last file handle is closed, at which time the disk space is freed. Windows doesn't allow removal of a file with open file handles until they are all closed, returning EACCES upon delete request instead. A quick fix would be to progressively back off and retry the delete a few times, if this is just timing-related. If not, more care is going to have to be taken to avoid the race/etc.
          Dave Cottlehuber made changes -
          Field Original Value New Value
          Assignee Dave Cottlehuber [ dch ]
          Hide
          Dave Cottlehuber added a comment -

          Sorry, lost this ticket by not assigning it to myself.

          Atomic moves of filenames are allowed within the same filesystem
          on Windows, even if the file is open.

          Modifying nuke_dir to relocate all files into .delete & then remove
          looks like a reasonable approach to resolving this cross-platform,
          that is crash-safe and can use a backoff to clean up if reqd on
          Window when view files are still held open.

          Erlang R15B01 (erts-5.9.1) [source] [smp:2:2] [async-threads:0]

          Eshell V5.9.1 (abort with ^G)
          1> pwd(),
          1> file:make_dir("nested"),
          1> file:make_dir(".delete"),
          1>

          {ok, File}

          = file:open("nested/file", [append]),
          1> file:write(File, <<"fill me up!">>).
          C:/tmp
          ok

          %% can't move a directory that has open handles
          2> file:rename("nested", ".delete
          nested").

          {error,einval}
          3> file:rename("nested", ".delete/nested").{error,einval}

          %% but we can move files that have open handles
          4> file:rename("nested/file", ".delete/file").
          ok

          5> file:write(File, <<"Moar Data">>).
          ok
          6>

          Show
          Dave Cottlehuber added a comment - Sorry, lost this ticket by not assigning it to myself. Atomic moves of filenames are allowed within the same filesystem on Windows, even if the file is open. Modifying nuke_dir to relocate all files into .delete & then remove looks like a reasonable approach to resolving this cross-platform, that is crash-safe and can use a backoff to clean up if reqd on Window when view files are still held open. Erlang R15B01 (erts-5.9.1) [source] [smp:2:2] [async-threads:0] Eshell V5.9.1 (abort with ^G) 1> pwd(), 1> file:make_dir( "nested"), 1> file:make_dir( ".delete"), 1> {ok, File} = file:open( "nested/file", [append] ), 1> file:write(File , <<"fill me up!">>). C:/tmp ok %% can't move a directory that has open handles 2> file:rename( "nested", ".delete nested"). {error,einval} 3> file:rename( "nested", ".delete/nested").{error,einval} %% but we can move files that have open handles 4> file:rename( "nested/file", ".delete/file"). ok 5> file:write(File , <<"Moar Data">>). ok 6>
          Hide
          Dave Cottlehuber added a comment -

          Refer to pull request https://github.com/apache/couchdb/pull/33 from Adam Lofts.

          Show
          Dave Cottlehuber added a comment - Refer to pull request https://github.com/apache/couchdb/pull/33 from Adam Lofts.
          Hide
          Dave Cottlehuber added a comment -

          @davisp advises we should be using a refcounter on open view files, and switching when these expire. Going to need some help on this!

          Show
          Dave Cottlehuber added a comment - @davisp advises we should be using a refcounter on open view files, and switching when these expire. Going to need some help on this!
          Hide
          Benoit Chesneau added a comment -

          @davisp since couch_index know who is waiting for the index couldn't we use that information for such purpose?

          Show
          Benoit Chesneau added a comment - @davisp since couch_index know who is waiting for the index couldn't we use that information for such purpose?
          Adam Kocoloski made changes -
          Assignee Dave Cottlehuber [ dch ] Adam Kocoloski [ kocolosk ]
          Hide
          Adam Kocoloski added a comment - - edited

          I took a closer look at this ticket today and came up with the patch below to ensure that view index files are closed before we attempt to nuke the directory when executing a reset_indexes call. Dave Cottlehuber helped me load the patch on a test Windows instance at EC2, and we confirmed that the patch allows the "basics" entry in the test suite to run to completion when it hadn't before. So, yay for that.

          Unfortunately, we're noticing frequent hangs in other portions of the test suite, including but not limited to the "design_options" test. My observations from an evening running tests:

          • The design_options test always passes with the log level set to "debug".
          • When the level is set to "info" the test often hangs, and the Erlang VM seems to hang as well.
          • The hang often occurs on the second invocation of the test.
          • The hang occurs with and without my patch.

          That last point is rather crucial. Dave Cottlehuber indicated that the test had been passing, but I can't seem to make that happen even with a stock build of the HEAD of 1.3.x. My current recommendation is to review and apply this patch as it only improves matters during my testing. That being said, I'd love to understand the root cause of these hangs when debug logging is disabled.

          diff --git a/src/couch_index/src/couch_index_server.erl b/src/couch_index/src/couch_index_server.erl
          index 48fa8e4..bc1fce7 100644
          — a/src/couch_index/src/couch_index_server.erl
          +++ b/src/couch_index/src/couch_index_server.erl
          @@ -160,7 +160,9 @@ reset_indexes(DbName, Root) ->
          % shutdown all the updaters and clear the files, the db got changed
          Fun = fun({_, {DDocId, Sig}}) ->
          [{_, Pid}] = ets:lookup(?BY_SIG,

          {DbName, Sig}

          ),

          • couch_util:shutdown_sync(Pid),
            + MRef = erlang:monitor(process, Pid),
            + gen_server:cast(Pid, delete),
            + receive {'DOWN', MRef, _, _, _}

            -> ok end,
            rem_from_ets(DbName, Sig, DDocId, Pid)
            end,
            lists:foreach(Fun, ets:lookup(?BY_DB, DbName)),

          Show
          Adam Kocoloski added a comment - - edited I took a closer look at this ticket today and came up with the patch below to ensure that view index files are closed before we attempt to nuke the directory when executing a reset_indexes call. Dave Cottlehuber helped me load the patch on a test Windows instance at EC2, and we confirmed that the patch allows the "basics" entry in the test suite to run to completion when it hadn't before. So, yay for that. Unfortunately, we're noticing frequent hangs in other portions of the test suite, including but not limited to the "design_options" test. My observations from an evening running tests: The design_options test always passes with the log level set to "debug". When the level is set to "info" the test often hangs, and the Erlang VM seems to hang as well. The hang often occurs on the second invocation of the test. The hang occurs with and without my patch. That last point is rather crucial. Dave Cottlehuber indicated that the test had been passing, but I can't seem to make that happen even with a stock build of the HEAD of 1.3.x. My current recommendation is to review and apply this patch as it only improves matters during my testing. That being said, I'd love to understand the root cause of these hangs when debug logging is disabled. diff --git a/src/couch_index/src/couch_index_server.erl b/src/couch_index/src/couch_index_server.erl index 48fa8e4..bc1fce7 100644 — a/src/couch_index/src/couch_index_server.erl +++ b/src/couch_index/src/couch_index_server.erl @@ -160,7 +160,9 @@ reset_indexes(DbName, Root) -> % shutdown all the updaters and clear the files, the db got changed Fun = fun({_, {DDocId, Sig}}) -> [{_, Pid}] = ets:lookup(?BY_SIG, {DbName, Sig} ), couch_util:shutdown_sync(Pid), + MRef = erlang:monitor(process, Pid), + gen_server:cast(Pid, delete), + receive {'DOWN', MRef, _, _, _} -> ok end, rem_from_ets(DbName, Sig, DDocId, Pid) end, lists:foreach(Fun, ets:lookup(?BY_DB, DbName)),
          Hide
          Dave Cottlehuber added a comment -

          Thanks Adam for your help on this. The most important point above is that the VM is hanging. It's possible that the IO during debug logging is too fast for the VM to keep up with, and that we end up blocking other IO. I will re-do these tests with +A 4 (which is what the Windows daemon & also unix uses also) and see if this helps. I am +1 on moving ahead with this patch & the release, and to see if the OTP team have any suggestions on troubleshooting the VM hang. I should make some good progress on this today also.

          Show
          Dave Cottlehuber added a comment - Thanks Adam for your help on this. The most important point above is that the VM is hanging. It's possible that the IO during debug logging is too fast for the VM to keep up with, and that we end up blocking other IO. I will re-do these tests with +A 4 (which is what the Windows daemon & also unix uses also) and see if this helps. I am +1 on moving ahead with this patch & the release, and to see if the OTP team have any suggestions on troubleshooting the VM hang. I should make some good progress on this today also.
          Hide
          Adam Kocoloski added a comment -

          I did some more reading and log-based debugging this morning. We're doing something rather screwy. It turns out the the individual view index processes are already monitoring their parent DB; when we delete the DB, the index is automatically closed:

          https://github.com/apache/couchdb/blob/bde29b/src/couch_index/src/couch_index.erl#L80
          https://github.com/apache/couchdb/blob/bde29b/src/couch_index/src/couch_index.erl#L296-L300

          The handler for that 'DOWN' message will invoke the gen_server's terminate function, which closes the file descriptor cleanly. So that's A Good Thing. Unfortunately, we don't always let that cleanup run to completion, because we've got this separate DB update notification listener:

          https://github.com/apache/couchdb/blob/bde29b/src/couch_index/src/couch_index_server.erl#L159-L168

          The couch_index processes do not trap exits, so when the reset_indexes function calls shutdown_sync it terminates the couch_index process immediately, bypassing any additional cleanup that we wanted to do. The patch I wrote allows for the termination to finish cleanly.

          It seems to me that even with the existing shutdown_sync invocation we'd eventually close all the file descriptors because of exit signal propagation, but at that point we may be racing the process that tries to delete them (and when we lose, we hang). The clean shutdown avoids that race.

          Show
          Adam Kocoloski added a comment - I did some more reading and log-based debugging this morning. We're doing something rather screwy. It turns out the the individual view index processes are already monitoring their parent DB; when we delete the DB, the index is automatically closed: https://github.com/apache/couchdb/blob/bde29b/src/couch_index/src/couch_index.erl#L80 https://github.com/apache/couchdb/blob/bde29b/src/couch_index/src/couch_index.erl#L296-L300 The handler for that 'DOWN' message will invoke the gen_server's terminate function, which closes the file descriptor cleanly. So that's A Good Thing. Unfortunately, we don't always let that cleanup run to completion, because we've got this separate DB update notification listener: https://github.com/apache/couchdb/blob/bde29b/src/couch_index/src/couch_index_server.erl#L159-L168 The couch_index processes do not trap exits, so when the reset_indexes function calls shutdown_sync it terminates the couch_index process immediately, bypassing any additional cleanup that we wanted to do. The patch I wrote allows for the termination to finish cleanly. It seems to me that even with the existing shutdown_sync invocation we'd eventually close all the file descriptors because of exit signal propagation, but at that point we may be racing the process that tries to delete them (and when we lose, we hang). The clean shutdown avoids that race.
          Hide
          Jan Lehnardt added a comment -

          Adam, how are you running the `design_options` tests?

          Show
          Jan Lehnardt added a comment - Adam, how are you running the `design_options` tests?
          Hide
          Volker Mische added a comment -

          This issue sounds similar to one we had at Couchbase.

          The Couchbase ticket: http://www.couchbase.com/issues/browse/MB-6957
          Which lead to Filipe's OTP patch: https://github.com/erlang/otp/commit/7c9ed2ed55e297dd78e7434851fa20063aa365f2
          And to a patch for the Couchbase-couchdb fork (which is still needed even when the OTP fix is in, see the ticket for more information): http://review.couchbase.org/22042

          Show
          Volker Mische added a comment - This issue sounds similar to one we had at Couchbase. The Couchbase ticket: http://www.couchbase.com/issues/browse/MB-6957 Which lead to Filipe's OTP patch: https://github.com/erlang/otp/commit/7c9ed2ed55e297dd78e7434851fa20063aa365f2 And to a patch for the Couchbase-couchdb fork (which is still needed even when the OTP fix is in, see the ticket for more information): http://review.couchbase.org/22042
          Hide
          Adam Kocoloski added a comment -

          Jan Lehnardt I was running it from Chrome installed locally on the Windows EC2 instance.

          Show
          Adam Kocoloski added a comment - Jan Lehnardt I was running it from Chrome installed locally on the Windows EC2 instance.
          Hide
          Dave Cottlehuber added a comment -

          What are we testing off?

          A fresh 1.3.x https://www.dropbox.com/sh/jeifcxpbtpo78ak/Jw-fm6LJrR/snapshots/20121202/setup-couchdb-1.3.0a-f1b2304-git_otp_R15B01.exe and a new .beam file with Adam's patch here https://www.dropbox.com/sh/jeifcxpbtpo78ak/L_6G_6HeeT/snapshots/20121202/couch_index_server.beam also that needs to be dropped into place.

          What are we seeing?

          The VM isn't actually completely hung, just blocking on IO and the console is part of that. This seems not to change based on +A x settings. The VM is waiting on input from couchjs processes, and when these are killed, couch springs back into life. Reducing [os_query_servers] os_process_limit = 1 still sees a hang, although as a bonus you can still talk to the VM now.

          Swapping in couchjs.exe from 1.2.0 shows no discernable difference, both hang.

          Show
          Dave Cottlehuber added a comment - What are we testing off? A fresh 1.3.x https://www.dropbox.com/sh/jeifcxpbtpo78ak/Jw-fm6LJrR/snapshots/20121202/setup-couchdb-1.3.0a-f1b2304-git_otp_R15B01.exe and a new .beam file with Adam's patch here https://www.dropbox.com/sh/jeifcxpbtpo78ak/L_6G_6HeeT/snapshots/20121202/couch_index_server.beam also that needs to be dropped into place. What are we seeing? The VM isn't actually completely hung, just blocking on IO and the console is part of that. This seems not to change based on +A x settings. The VM is waiting on input from couchjs processes, and when these are killed, couch springs back into life. Reducing [os_query_servers] os_process_limit = 1 still sees a hang, although as a bonus you can still talk to the VM now. Swapping in couchjs.exe from 1.2.0 shows no discernable difference, both hang.
          Hide
          Paul Joseph Davis added a comment -

          Reviewed Adam's patch and I'm +1 on applying it.

          @dch - The fact that you can kill couchjs to free up the VM is quite intriguing. Is it possible to get a stack trace on Windows like you can with gdb on linux? If you can, having a stack trace of a couchjs that you then kill to free the VM could possibly be the key to figuring that out.

          Show
          Paul Joseph Davis added a comment - Reviewed Adam's patch and I'm +1 on applying it. @dch - The fact that you can kill couchjs to free up the VM is quite intriguing. Is it possible to get a stack trace on Windows like you can with gdb on linux? If you can, having a stack trace of a couchjs that you then kill to free the VM could possibly be the key to figuring that out.
          Hide
          Paul Joseph Davis added a comment -

          @dch - And if you can do that, getting a backtrace for all threads in the Erlang VM before you kill couchjs as well might help.

          Show
          Paul Joseph Davis added a comment - @dch - And if you can do that, getting a backtrace for all threads in the Erlang VM before you kill couchjs as well might help.
          Show
          Jan Lehnardt added a comment - This may or may not be of service: http://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Windows
          Hide
          Dave Cottlehuber added a comment -

          Generally, there are repeated hangs during basic things like fputs/freads.
          I see buffers in couchjs that appear to be missing a character such as
          '\nnteger', & from what I read couchjs is not returning from the blocking
          fputs/fgets OS call. I can't tell if the port is still open on the erlang
          side or not, I assume so.

          e.g. _ptr 0x74aa32a0 "[\"map_doc\",

          {\"_id\":\"99\",\"_rev\":\"1-463ef1de0f185aeab70ce3c346a96572\",\"integer\":99,\"string\":\"99\"}

          ]\n}]\nnteger);
          n })\"]\n" char *

          I think this is because we are doing open_port(... binary ...) in Erlang,
          but opening the stream as text mode only in Windows. Thus the next line
          will lose a character (explaining the '\nnteger' stuff I was seeing in
          the debugger), and eventually we'll hit a point where we waitin on an
          additional character that will never be sent.

          I tried changing fopen to "rb" but I think we only use this for files like main.js and not stdin:

          diff --git i/src/couchdb/priv/couch_js/util.c w/src/couchdb/priv/couch_js/util.c
          index 5c88402..e46f66f 100644
          — i/src/couchdb/priv/couch_js/util.c
          +++ w/src/couchdb/priv/couch_js/util.c
          @@ -33,7 +33,7 @@ slurp_file(const char* file, char** outbuf_p)
          if(strcmp(file, "-") == 0)

          { fp = stdin; }

          else {

          • fp = fopen(file, "r");
            + fp = fopen(file, "rb");
            if(fp == NULL) {
            fprintf(stderr, "Failed to read file: %s\n", file);
            exit(3);

          So next step is digging for that on erlang side instead and sending "\r\n"
          instead of just "\n". Maybe we can drop binary as an option for windows
          ports. I'm assuming this is in src/couchdb/couch_os_process.erl
          or src/couchdb/couch_os_daemons.erl and related to
          http://erldocs.com/R15B/erts/erlang.html?i=0&search=open_port#open_port/2
          which seems to be set here:

          git grep PORT_OPTIONS
          src/couchdb/couch_os_daemons.erl:-define(PORT_OPTIONS, [stream,

          {line, 1024}

          , binary, exit_status, hide]).
          src/couchdb/couch_os_daemons.erl: Port = open_port(

          {spawn, Spawnkiller ++ " " ++ Command}

          , ?PORT_OPTIONS),
          src/couchdb/couch_os_process.erl:-define(PORT_OPTIONS, [stream,

          {line, 4096}

          , binary, exit_status, hide]).
          src/couchdb/couch_os_process.erl: start_link(Command, Options, ?PORT_OPTIONS).

          Stack Traces:

              1. COUCHJS ###

          Callstack for Thread 1 (Thread Id: 1372 (0x55c)):
          Index Function
          --------------------------------------------------------------------------------
          1 ntdll.dll!_NtWriteFile@36()
          2 KernelBase.dll!_WriteFile@20()
          3 msvcr100.dll!__write_nolock()
          4 msvcr100.dll!__write()
          5 msvcr100.dll!__flush()
          6 msvcr100.dll!__fflush_nolock()
          7 msvcr100.dll!_fflush()
          *8 couchjs.exe!couch_print(JSContext * cx=0x025874f0, unsigned int argc=1, unsigned __int64 * argv=0x026401a0)
          9 couchjs.exe!print(JSContext * cx=0x025874f0, unsigned int argc=1, unsigned __int64 * vp=0x02640190)
          10 mozjs185-1.0.dll!6decf09d()
          11 [Frames below may be incorrect and/or missing, no symbols loaded for mozjs185-1.0.dll]
          12 ntdll.dll!_RtlpHeapFindListLookupEntry@20()
          13 ntdll.dll!_RtlpFindEntry@8()
          14 0260b5c8()
          15 ntdll.dll!_RtlpHeapFindListLookupEntry@20()
          16 ntdll.dll!_RtlpFindEntry@8()
          17 ntdll.dll!@RtlpFreeHeap@16()
          18 ntdll.dll!_RtlpHeapAddListEntry@24()
          19 ntdll.dll!@RtlpCreateSplitBlock@28()
          20 ntdll.dll!@RtlpAllocateHeap@24()
          21 ntdll.dll!_RtlAllocateHeap@12()

          Which is in util.c:

          couch_print(JSContext* cx, uintN argc, jsval* argv)
          {
          char *bytes = NULL;
          FILE *stream = stdout;

          if (argc) {
          if (argc > 1 && argv[1] == JSVAL_TRUE)

          { stream = stderr; }

          bytes = enc_string(cx, argv[0], NULL);
          if(!bytes) return;
          fprintf(stream, "%s", bytes);
          JS_free(cx, bytes);
          }

          fputc('\n', stream); // dch: we never return from this
          fflush(stream);
          }

              1. COUCHSPAWNKILLABLE ###

          couchspawnkillable looks clean to me, its just waiting for the couchjs process to terminate:

          // Wait for the process to terminate so we can reflect the exit code
          // back to couch.
          WaitForSingleObject(pi.hProcess, INFINITE);
          if (!GetExitCodeProcess(pi.hProcess, &exitcode))
          return 6;
          return exitcode;
          }

          Show
          Dave Cottlehuber added a comment - Generally, there are repeated hangs during basic things like fputs/freads. I see buffers in couchjs that appear to be missing a character such as '\nnteger', & from what I read couchjs is not returning from the blocking fputs/fgets OS call. I can't tell if the port is still open on the erlang side or not, I assume so. e.g. _ptr 0x74aa32a0 "[\"map_doc\", {\"_id\":\"99\",\"_rev\":\"1-463ef1de0f185aeab70ce3c346a96572\",\"integer\":99,\"string\":\"99\"} ]\n}]\nnteger); n })\"]\n" char * I think this is because we are doing open_port(... binary ...) in Erlang, but opening the stream as text mode only in Windows. Thus the next line will lose a character (explaining the '\nnteger' stuff I was seeing in the debugger), and eventually we'll hit a point where we waitin on an additional character that will never be sent. I tried changing fopen to "rb" but I think we only use this for files like main.js and not stdin: diff --git i/src/couchdb/priv/couch_js/util.c w/src/couchdb/priv/couch_js/util.c index 5c88402..e46f66f 100644 — i/src/couchdb/priv/couch_js/util.c +++ w/src/couchdb/priv/couch_js/util.c @@ -33,7 +33,7 @@ slurp_file(const char* file, char** outbuf_p) if(strcmp(file, "-") == 0) { fp = stdin; } else { fp = fopen(file, "r"); + fp = fopen(file, "rb"); if(fp == NULL) { fprintf(stderr, "Failed to read file: %s\n", file); exit(3); So next step is digging for that on erlang side instead and sending "\r\n" instead of just "\n". Maybe we can drop binary as an option for windows ports. I'm assuming this is in src/couchdb/couch_os_process.erl or src/couchdb/couch_os_daemons.erl and related to http://erldocs.com/R15B/erts/erlang.html?i=0&search=open_port#open_port/2 which seems to be set here: git grep PORT_OPTIONS src/couchdb/couch_os_daemons.erl:-define(PORT_OPTIONS, [stream, {line, 1024} , binary, exit_status, hide]). src/couchdb/couch_os_daemons.erl: Port = open_port( {spawn, Spawnkiller ++ " " ++ Command} , ?PORT_OPTIONS), src/couchdb/couch_os_process.erl:-define(PORT_OPTIONS, [stream, {line, 4096} , binary, exit_status, hide]). src/couchdb/couch_os_process.erl: start_link(Command, Options, ?PORT_OPTIONS). Stack Traces: COUCHJS ### Callstack for Thread 1 (Thread Id: 1372 (0x55c)): Index Function -------------------------------------------------------------------------------- 1 ntdll.dll!_NtWriteFile@36() 2 KernelBase.dll!_WriteFile@20() 3 msvcr100.dll!__write_nolock() 4 msvcr100.dll!__write() 5 msvcr100.dll!__flush() 6 msvcr100.dll!__fflush_nolock() 7 msvcr100.dll!_fflush() *8 couchjs.exe!couch_print(JSContext * cx=0x025874f0, unsigned int argc=1, unsigned __int64 * argv=0x026401a0) 9 couchjs.exe!print(JSContext * cx=0x025874f0, unsigned int argc=1, unsigned __int64 * vp=0x02640190) 10 mozjs185-1.0.dll!6decf09d() 11 [Frames below may be incorrect and/or missing, no symbols loaded for mozjs185-1.0.dll] 12 ntdll.dll!_RtlpHeapFindListLookupEntry@20() 13 ntdll.dll!_RtlpFindEntry@8() 14 0260b5c8() 15 ntdll.dll!_RtlpHeapFindListLookupEntry@20() 16 ntdll.dll!_RtlpFindEntry@8() 17 ntdll.dll!@RtlpFreeHeap@16() 18 ntdll.dll!_RtlpHeapAddListEntry@24() 19 ntdll.dll!@RtlpCreateSplitBlock@28() 20 ntdll.dll!@RtlpAllocateHeap@24() 21 ntdll.dll!_RtlAllocateHeap@12() Which is in util.c: couch_print(JSContext* cx, uintN argc, jsval* argv) { char *bytes = NULL; FILE *stream = stdout; if (argc) { if (argc > 1 && argv [1] == JSVAL_TRUE) { stream = stderr; } bytes = enc_string(cx, argv [0] , NULL); if(!bytes) return; fprintf(stream, "%s", bytes); JS_free(cx, bytes); } fputc('\n', stream); // dch: we never return from this fflush(stream); } COUCHSPAWNKILLABLE ### couchspawnkillable looks clean to me, its just waiting for the couchjs process to terminate: // Wait for the process to terminate so we can reflect the exit code // back to couch. WaitForSingleObject(pi.hProcess, INFINITE); if (!GetExitCodeProcess(pi.hProcess, &exitcode)) return 6; return exitcode; }
          Hide
          Jan Lehnardt added a comment -

          Dave, excellent findings. Fancy trying this patch?

          diff --git a/src/couchdb/couch_os_process.erl b/src/couchdb/couch_os_process.erl
          index 3a267be..13fa7f4 100644
          — a/src/couchdb/couch_os_process.erl
          +++ b/src/couchdb/couch_os_process.erl
          @@ -94,7 +94,7 @@ drop_port_messages(Port) ->
          % Utility functions for reading and writing
          % in custom functions
          writeline(OsProc, Data) when is_record(OsProc, os_proc) ->

          • port_command(OsProc#os_proc.port, [Data, $\n]).
            + port_command(OsProc#os_proc.port, [Data, "\r\n"]).
          Show
          Jan Lehnardt added a comment - Dave, excellent findings. Fancy trying this patch? diff --git a/src/couchdb/couch_os_process.erl b/src/couchdb/couch_os_process.erl index 3a267be..13fa7f4 100644 — a/src/couchdb/couch_os_process.erl +++ b/src/couchdb/couch_os_process.erl @@ -94,7 +94,7 @@ drop_port_messages(Port) -> % Utility functions for reading and writing % in custom functions writeline(OsProc, Data) when is_record(OsProc, os_proc) -> port_command(OsProc#os_proc.port, [Data, $\n] ). + port_command(OsProc#os_proc.port, [Data, "\r\n"] ).
          Hide
          Paul Joseph Davis added a comment -

          The code snippet you provided you marked fputc as the call we're blocked on but the traceback makes me think its the line after that with the fflush call.

          Still, it looks as though we're blocked trying to flush out that buffer from Windows and assuming Erlang is just sitting around waiting for a message then I could see this being a classic dead lock (depending on how Erlang does its line buffering). I'll try and find the Erlang internals that manage this bit of code to see what might be possible.

          Show
          Paul Joseph Davis added a comment - The code snippet you provided you marked fputc as the call we're blocked on but the traceback makes me think its the line after that with the fflush call. Still, it looks as though we're blocked trying to flush out that buffer from Windows and assuming Erlang is just sitting around waiting for a message then I could see this being a classic dead lock (depending on how Erlang does its line buffering). I'll try and find the Erlang internals that manage this bit of code to see what might be possible.
          Hide
          Paul Joseph Davis added a comment -

          Dave Cottlehuber While I read through code here, one of the things you might want to try is changing that

          {line, 4096}

          option to

          {line, 1024}

          or even something really small like

          {line, 64}

          to see if my theory on mismatched buffer sizes is anywhere near the mark.

          Show
          Paul Joseph Davis added a comment - Dave Cottlehuber While I read through code here, one of the things you might want to try is changing that {line, 4096} option to {line, 1024} or even something really small like {line, 64} to see if my theory on mismatched buffer sizes is anywhere near the mark.
          Hide
          Dave Cottlehuber added a comment -

          I've done a fair few runs now and we are blocking on return from fputc every time. Internally the NT API looks like it is unable to flush data to the pipe, and that's why we end up blocking. Not really sure how to peer inside this further.

          couch_config:set("query_server_config", "os_process_limit", "1").
          and this patch: https://www.friendpaste.com/5vB0pIKtXXRM0eij296ibX which includes sending \r\n, logging all port traffic as LOG_INFO, and reduced buffer size.

          Reporting back;

          https://www.friendpaste.com/2KfwQCQ8kFhqY3ni1hqVf2

          <lots_of_data_above>

          [info] [<0.4581.0>] OS Process #Port<0.103763> Output :: ["chunks",["\u000a<li>Key: 9 Value: 9 LineNo: 10</li>"]]
          [info] [<0.4581.0>] OS Process #Port<0.103763> Input :: ["list_end"]
          [info] [<0.4581.0>] OS Process #Port<0.103763> Output :: ["end",["</ul><p>FirstKey: 1 LastKey: 9</p>"]]
          [info] [<0.4581.0>] OS Process #Port<0.103763> Input :: ["reset",

          {"reduce_limit":true,"timeout":5000}

          ]
          [info] [<0.4581.0>] OS Process #Port<0.103763> Output :: true

          1>
          =INFO REPORT==== 7-Dec-2012::01:35:38 ===
          alarm_handler:

          {clear,system_memory_high_watermark}

          1> erlang:process_info(pid(0,4581,0)).
          [{current_function,{gen_server,loop,6}},
          {initial_call,{proc_lib,init_p,5}},

          {status,waiting}

          ,

          {message_queue_len,0}

          ,

          {messages,[]}

          ,

          {links,[#Port<0.103763>,<0.4548.0>]}

          ,
          {dictionary,[

          {'$ancestors',[couch_query_servers, couch_secondary_services,couch_server_sup,<0.4441.0>]}

          ,
          {'$initial_call',{couch_os_process,init,1}}]},

          {trap_exit,false}

          ,

          {error_handler,error_handler}

          ,

          {priority,normal}

          ,

          {group_leader,<0.4440.0>}

          ,

          {total_heap_size,17711}

          ,

          {heap_size,6765}

          ,

          {stack_size,9}

          ,

          {reductions,160580}

          ,
          {garbage_collection,[

          {min_bin_vheap_size,46368}

          ,

          {min_heap_size,233}

          ,

          {fullsweep_after,65535}

          ,

          {minor_gcs,73}

          ]},

          {suspending,[]}

          ]
          2>
          =INFO REPORT==== 7-Dec-2012::01:36:38 ===
          alarm_handler: {set,{system_memory_high_watermark,[]}}

              1. couchjs.c ###

          same place again:

          >~*k
          Callstack for Thread 1 (Thread Id: 716 (0x2cc)):
          Index Function
          --------------------------------------------------------------------------------
          1 ntdll.dll!_NtReadFile@36()
          2 KernelBase.dll!_ReadFile@20()
          3 msvcr100.dll!__read_nolock()
          4 msvcr100.dll!__read()
          5 msvcr100.dll!__filbuf()
          6 msvcr100.dll!_getc()
          *7 couchjs.exe!couch_readline(JSContext * cx=0x00b482c0, _iobuf * fp=0x73a33008)
          8 couchjs.exe!readline(JSContext * cx=0x00b482c0, unsigned int argc=0, unsigned __int64 * vp=0x02690118)
          9 mozjs185-1.0.dll!6fa5f09d()
          10 [Frames below may be incorrect and/or missing, no symbols loaded for mozjs185-1.0.dll]
          11 mozjs185-1.0.dll!JS_CompareValues() + 7967 bytes
          12 mozjs185-1.0.dll!JS_GetScopeChain() + 5110 bytes
          13 mozjs185-1.0.dll!JS_ExecuteScript() + 34 bytes
          14 couchjs.exe!main(int argc=2, const char * * argv=0x02681948)
          15 couchjs.exe!__tmainCRTStartup()
          16 kernel32.dll!@BaseThreadInitThunk@12()
          17 ntdll.dll!___RtlUserThreadStart@8()
          18 ntdll.dll!__RtlUserThreadStart@8()

          The Erlang Port is still up & running:

          2> erlang:ports().
          [#Port<0.98509>,#Port<0.98518>,#Port<0.103763>,
          #Port<0.103903>,#Port<0.103922>,#Port<0.103924>,
          #Port<0.103926>,#Port<0.100936>,#Port<0.102133>,
          #Port<0.102135>,#Port<0.100182>,#Port<0.101219>,
          #Port<0.98172>,#Port<0.98217>,#Port<0.101331>,
          #Port<0.102391>]
          3> [ ,, Port | _] = erlang:ports().
          [#Port<0.98509>,#Port<0.98518>,#Port<0.103763>,
          #Port<0.103903>,#Port<0.103922>,#Port<0.103924>,
          #Port<0.103926>,#Port<0.100936>,#Port<0.102133>,
          #Port<0.102135>,#Port<0.100182>,#Port<0.101219>,
          #Port<0.98172>,#Port<0.98217>,#Port<0.101331>,
          #Port<0.102391>]
          4> Port.
          #Port<0.103763>
          5> erlang:port
          port_call/2 port_call/3 port_close/1 port_command/2
          port_command/3 port_connect/2 port_control/3 port_get_data/1
          port_info/1 port_info/2 port_set_data/2 port_to_list/1
          ports/0
          5> erlang:port_info(Port).
          [

          {name,"c:/werl/OTP_SR~1/release/win32/lib/couch-1.3.0a-e64bbec-git/priv/couchspawnkillable ./couchjs.exe ../share/couchdb/server/main.js"}

          ,

          {links,[<0.4581.0>]}

          ,

          {id,103763}

          ,

          {connected,<0.4581.0>}

          ,

          {input,72311}

          ,

          {output,77607}

          ,

          {os_pid,2364}

          ]
          6>

          The data still in the buffer is 4106 bytes:

          y;\n send('
          n<li>Key: 'row.key\n +' Value: '+row.value\n +' LineNo: '+row_number'</li>');\n }\n return '</ul><p>FirstKey: '+ firstKey + ' LastKey: '+ prevKey+'</p>';\n })","acceptSwitch":"(function (head, req) {\n // respondWith takes care of setting the proper headers\n provides(\"html\", function() {\n send(\"HTML <ul>\");\n\n var row, num = 0;\n while (row = getRow())

          {\n num ++;\n send('\\n<li>Key: '\n +row.key+' Value: '+row.value\n +' LineNo: '+num+'</li>');\n }

          \n\n // tail\n return '</ul>';\n });\n\n provides(\"xml\", function() {\n send('<feed xmlns=\"http://www.w3.org/2005/Atom\">'\n +'<title>Test XML Feed</title>');\n\n while (row = getRow())

          {\n var entry = new XML('<entry/>');\n entry.id = row.id;\n entry.title = row.key;\n entry.content = row.value;\n send(entry);\n }

          \n return \"</feed>\";\n });\n })","qsParams":"(function (head, req)

          {\n return toJSON(req.query) + \"\\n\";\n }

          )","stopIter":"(function (req) {\n send(\"head\");\n var row, row_number = 0;\n while(row = getRow())

          {\n if(row_number > 2) break;\n send(\" \" + row_number);\n row_number += 1;\n }

          ;\n return \" tail\";\n })","stopIter2":"(function (head, req) {\n provides(\"html\", function() {\n send(\"head\");\n var row, row_number = 0;\n while(row = getRow())

          {\n if(row_number > 2) break;\n send(\" \" + row_number);\n row_number += 1;\n }

          ;\n return \" tail\";\n });\n })","tooManyGetRows":"(function () {\n send(\"head\");\n var row;\n while(row = getRow())

          {\n send(row.key);\n }

          ;\n getRow();\n getRow();\n getRow();\n row = getRow();\n return \"after row: \"+toJSON(row);\n })","emptyList":"(function ()

          {\n return \" \";\n }

          )","rowError":"(function (head, req)

          {\n send(\"head\");\n var row = getRow();\n send(fooBarBam); // intentional error\n return \"tail\";\n }

          )","docReference":"(function (head, req)

          {\n send(\"head\");\n var row = getRow();\n send(row.doc.integer);\n return \"tail\";\n }

          )","secObj":"(function (head, req)

          {\n return toJSON(req.secObj);\n }

          )","setHeaderAfterGotRow":"(function (head, req) {\n getRow();\ðñÿÿ

          The last few characters are weird; in hex that is the next 4k buffer beginning: C3 B0 C3 B1 C3 BF C3 BF 01 0A.

              1. tomorrow ###

          try and remove the couch* code from this and just make a minimal erlang + port wrapper around couchjs.

          ideas welcomed!

          Show
          Dave Cottlehuber added a comment - I've done a fair few runs now and we are blocking on return from fputc every time. Internally the NT API looks like it is unable to flush data to the pipe, and that's why we end up blocking. Not really sure how to peer inside this further. couch_config:set("query_server_config", "os_process_limit", "1"). and this patch: https://www.friendpaste.com/5vB0pIKtXXRM0eij296ibX which includes sending \r\n, logging all port traffic as LOG_INFO, and reduced buffer size. Reporting back; I switched buffer size to 1024 no difference. Note that 4095 was already set in 1.2.0 https://github.com/apache/couchdb/commit/2a2f488b002f379f23d9ec9f64ed4170178b7227 . ditto for reducing it down to 64 even. this time I can get the erlang shell running too, and the port & gen_server wrapper info follow: https://www.friendpaste.com/2KfwQCQ8kFhqY3ni1hqVf2 <lots_of_data_above> [info] [<0.4581.0>] OS Process #Port<0.103763> Output :: ["chunks", ["\u000a<li>Key: 9 Value: 9 LineNo: 10</li>"] ] [info] [<0.4581.0>] OS Process #Port<0.103763> Input :: ["list_end"] [info] [<0.4581.0>] OS Process #Port<0.103763> Output :: ["end", ["</ul><p>FirstKey: 1 LastKey: 9</p>"] ] [info] [<0.4581.0>] OS Process #Port<0.103763> Input :: ["reset", {"reduce_limit":true,"timeout":5000} ] [info] [<0.4581.0>] OS Process #Port<0.103763> Output :: true 1> =INFO REPORT==== 7-Dec-2012::01:35:38 === alarm_handler: {clear,system_memory_high_watermark} 1> erlang:process_info(pid(0,4581,0)). [{current_function,{gen_server,loop,6}}, {initial_call,{proc_lib,init_p,5}}, {status,waiting} , {message_queue_len,0} , {messages,[]} , {links,[#Port<0.103763>,<0.4548.0>]} , {dictionary,[ {'$ancestors',[couch_query_servers, couch_secondary_services,couch_server_sup,<0.4441.0>]} , {'$initial_call',{couch_os_process,init,1}}]}, {trap_exit,false} , {error_handler,error_handler} , {priority,normal} , {group_leader,<0.4440.0>} , {total_heap_size,17711} , {heap_size,6765} , {stack_size,9} , {reductions,160580} , {garbage_collection,[ {min_bin_vheap_size,46368} , {min_heap_size,233} , {fullsweep_after,65535} , {minor_gcs,73} ]}, {suspending,[]} ] 2> =INFO REPORT==== 7-Dec-2012::01:36:38 === alarm_handler: {set,{system_memory_high_watermark,[]}} couchjs.c ### same place again: >~*k Callstack for Thread 1 (Thread Id: 716 (0x2cc)): Index Function -------------------------------------------------------------------------------- 1 ntdll.dll!_NtReadFile@36() 2 KernelBase.dll!_ReadFile@20() 3 msvcr100.dll!__read_nolock() 4 msvcr100.dll!__read() 5 msvcr100.dll!__filbuf() 6 msvcr100.dll!_getc() *7 couchjs.exe!couch_readline(JSContext * cx=0x00b482c0, _iobuf * fp=0x73a33008) 8 couchjs.exe!readline(JSContext * cx=0x00b482c0, unsigned int argc=0, unsigned __int64 * vp=0x02690118) 9 mozjs185-1.0.dll!6fa5f09d() 10 [Frames below may be incorrect and/or missing, no symbols loaded for mozjs185-1.0.dll] 11 mozjs185-1.0.dll!JS_CompareValues() + 7967 bytes 12 mozjs185-1.0.dll!JS_GetScopeChain() + 5110 bytes 13 mozjs185-1.0.dll!JS_ExecuteScript() + 34 bytes 14 couchjs.exe!main(int argc=2, const char * * argv=0x02681948) 15 couchjs.exe!__tmainCRTStartup() 16 kernel32.dll!@BaseThreadInitThunk@12() 17 ntdll.dll!___RtlUserThreadStart@8() 18 ntdll.dll!__RtlUserThreadStart@8() The Erlang Port is still up & running: 2> erlang:ports(). [#Port<0.98509>,#Port<0.98518>,#Port<0.103763>, #Port<0.103903>,#Port<0.103922>,#Port<0.103924>, #Port<0.103926>,#Port<0.100936>,#Port<0.102133>, #Port<0.102135>,#Port<0.100182>,#Port<0.101219>, #Port<0.98172>,#Port<0.98217>,#Port<0.101331>, #Port<0.102391>] 3> [ , , Port | _] = erlang:ports(). [#Port<0.98509>,#Port<0.98518>,#Port<0.103763>, #Port<0.103903>,#Port<0.103922>,#Port<0.103924>, #Port<0.103926>,#Port<0.100936>,#Port<0.102133>, #Port<0.102135>,#Port<0.100182>,#Port<0.101219>, #Port<0.98172>,#Port<0.98217>,#Port<0.101331>, #Port<0.102391>] 4> Port. #Port<0.103763> 5> erlang:port port_call/2 port_call/3 port_close/1 port_command/2 port_command/3 port_connect/2 port_control/3 port_get_data/1 port_info/1 port_info/2 port_set_data/2 port_to_list/1 ports/0 5> erlang:port_info(Port). [ {name,"c:/werl/OTP_SR~1/release/win32/lib/couch-1.3.0a-e64bbec-git/priv/couchspawnkillable ./couchjs.exe ../share/couchdb/server/main.js"} , {links,[<0.4581.0>]} , {id,103763} , {connected,<0.4581.0>} , {input,72311} , {output,77607} , {os_pid,2364} ] 6> The data still in the buffer is 4106 bytes: y;\n send(' n<li>Key: ' row.key\n +' Value: '+row.value\n +' LineNo: '+row_number '</li>');\n }\n return '</ul><p>FirstKey: '+ firstKey + ' LastKey: '+ prevKey+'</p>';\n })","acceptSwitch":"(function (head, req) {\n // respondWith takes care of setting the proper headers\n provides(\"html\", function() {\n send(\"HTML <ul>\");\n\n var row, num = 0;\n while (row = getRow()) {\n num ++;\n send('\\n<li>Key: '\n +row.key+' Value: '+row.value\n +' LineNo: '+num+'</li>');\n } \n\n // tail\n return '</ul>';\n });\n\n provides(\"xml\", function() {\n send('<feed xmlns=\"http://www.w3.org/2005/Atom\">'\n +'<title>Test XML Feed</title>');\n\n while (row = getRow()) {\n var entry = new XML('<entry/>');\n entry.id = row.id;\n entry.title = row.key;\n entry.content = row.value;\n send(entry);\n } \n return \"</feed>\";\n });\n })","qsParams":"(function (head, req) {\n return toJSON(req.query) + \"\\n\";\n } )","stopIter":"(function (req) {\n send(\"head\");\n var row, row_number = 0;\n while(row = getRow()) {\n if(row_number > 2) break;\n send(\" \" + row_number);\n row_number += 1;\n } ;\n return \" tail\";\n })","stopIter2":"(function (head, req) {\n provides(\"html\", function() {\n send(\"head\");\n var row, row_number = 0;\n while(row = getRow()) {\n if(row_number > 2) break;\n send(\" \" + row_number);\n row_number += 1;\n } ;\n return \" tail\";\n });\n })","tooManyGetRows":"(function () {\n send(\"head\");\n var row;\n while(row = getRow()) {\n send(row.key);\n } ;\n getRow();\n getRow();\n getRow();\n row = getRow();\n return \"after row: \"+toJSON(row);\n })","emptyList":"(function () {\n return \" \";\n } )","rowError":"(function (head, req) {\n send(\"head\");\n var row = getRow();\n send(fooBarBam); // intentional error\n return \"tail\";\n } )","docReference":"(function (head, req) {\n send(\"head\");\n var row = getRow();\n send(row.doc.integer);\n return \"tail\";\n } )","secObj":"(function (head, req) {\n return toJSON(req.secObj);\n } )","setHeaderAfterGotRow":"(function (head, req) {\n getRow();\ðñÿÿ The last few characters are weird; in hex that is the next 4k buffer beginning: C3 B0 C3 B1 C3 BF C3 BF 01 0A. tomorrow ### try and remove the couch* code from this and just make a minimal erlang + port wrapper around couchjs. ideas welcomed!
          Show
          Dave Cottlehuber added a comment - notes for dch ### There are a lot of changes in emulator's sys.c https://github.com/erlang/otp/commit/3985e56bfa82083b23978e06809b8ab0cf9d52d8 https://github.com/erlang/otp/commit/e11bbb37273ebd2408d0c83c62770f9ef023879d https://github.com/erlang/otp/commit/fed20c731b91f1debb11a809cc5999d8b04dd293 https://github.com/erlang/otp/commit/2a400b88a44ea35a992b3b46acb84a632bd7d7e9 and some variations on how these are handled (via pierre) elsewhere, who has seen similar issues: https://github.com/ghc/packages-process/blob/master/cbits/runProcess.c https://github.com/php/php-src/commit/03b39896a6df71fb0ccf379a9148eba910f6eb84 https://github.com/php/php-src/commit/49666eece6b8d5de74f1bae2537e8cc3919203ee https://github.com/php/php-src/commit/1cd82c9db23749b692b35fccd2ba17ee9d1f60c2 https://bugs.php.net/bug.php?id=51800 https://bugs.php.net/bug.php?id=44942 and how couch does it: https://github.com/apache/couchdb/blob/1.3.x/src/couchdb/priv/spawnkillable/couchspawnkillable_win.c reverting some changes e.g. https://github.com/apache/couchdb/commit/a851c6e5 has no impact.
          Hide
          Dave Cottlehuber added a comment -

          http://support.microsoft.com/kb/315939 wat createprocess not threadsafe.

          Show
          Dave Cottlehuber added a comment - http://support.microsoft.com/kb/315939 wat createprocess not threadsafe.
          Hide
          Dave Cottlehuber added a comment -
              1. Notes on git bisection ###
          • the 1.3.x build tree expects static SSL and ICU 4.6.1.
          • cherry pick e3651 and 1054f respectively from master to cure versions that don't have these patches included.
          • start bisecting at 1054f (COUCHDB-1152 remove icu version dependency)
          Show
          Dave Cottlehuber added a comment - Notes on git bisection ### the 1.3.x build tree expects static SSL and ICU 4.6.1. cherry pick e3651 and 1054f respectively from master to cure versions that don't have these patches included. start bisecting at 1054f ( COUCHDB-1152 remove icu version dependency)
          Hide
          Jan Lehnardt added a comment -

          another one to cherry-pick:

          commit fb670f5712c08c90b5bd93028c7df39c18fb9656
          Author: Dave Cottlehuber <dch@apache.org>
          Date: Sat Dec 1 22:32:52 2012 +0100

          Make dependency on unistd.h specific to HAVE_CURL instead in the whole file

          • introduced in COUCHDB-1338
          • allows building cleanly on Windows without cURL
          Show
          Jan Lehnardt added a comment - another one to cherry-pick: commit fb670f5712c08c90b5bd93028c7df39c18fb9656 Author: Dave Cottlehuber <dch@apache.org> Date: Sat Dec 1 22:32:52 2012 +0100 Make dependency on unistd.h specific to HAVE_CURL instead in the whole file introduced in COUCHDB-1338 allows building cleanly on Windows without cURL
          Hide
          Dave Cottlehuber added a comment -

          more cherries: 'e3651ec5' resolves ln: accessing `/cygdrive/d/werl/otp_src_R14B04/release/win32/bin/couchjs': Too many levels of symbolic links

          Show
          Dave Cottlehuber added a comment - more cherries: 'e3651ec5' resolves ln: accessing `/cygdrive/d/werl/otp_src_R14B04/release/win32/bin/couchjs': Too many levels of symbolic links
          Hide
          Jan Lehnardt added a comment -

          There is always the possibility that I got something wrong, but three elaborate bisect sessions lead me to:

          drumroll

          a851c6e5150d14221ca018587d76214856c1555a is the first bad commit
          commit a851c6e5150d14221ca018587d76214856c1555a
          Author: Filipe David Borba Manana <fdmanana@apache.org>
          Date: Sun Nov 6 14:25:04 2011 +0000

          More efficient communication with the view server

          This change makes the communication between the Erlang VM and
          an external view server (couchjs for e.g.) more efficient by
          writing a series of commands into the port and reading all the
          responses from the external view server after doing all those
          writes. This minimizes the amount of time each endpoint spends
          blocked reading from the port.

          COUCHDB-1334

          :040000 040000 608c3712f9544111205f952d9c88456f41c1c9ed 13cbf3c017ab73fba15a4278ded73285ef0215aa M src

          Show
          Jan Lehnardt added a comment - There is always the possibility that I got something wrong, but three elaborate bisect sessions lead me to: drumroll a851c6e5150d14221ca018587d76214856c1555a is the first bad commit commit a851c6e5150d14221ca018587d76214856c1555a Author: Filipe David Borba Manana <fdmanana@apache.org> Date: Sun Nov 6 14:25:04 2011 +0000 More efficient communication with the view server This change makes the communication between the Erlang VM and an external view server (couchjs for e.g.) more efficient by writing a series of commands into the port and reading all the responses from the external view server after doing all those writes. This minimizes the amount of time each endpoint spends blocked reading from the port. COUCHDB-1334 :040000 040000 608c3712f9544111205f952d9c88456f41c1c9ed 13cbf3c017ab73fba15a4278ded73285ef0215aa M src
          Hide
          Jan Lehnardt added a comment -

          And as a point of confirmation, master with the patch reverted works as expected.

          Our options:

          A)

          • revert and release 1.3.0
          • understand the root cause and work around it if possible and re-land in 1.4 or later.

          B)

          • revert and reintroduce with a OS-switch that disables the code on Win32.
          • understand the root cause and work around it if possible and re-enable in Win32 in 1.4 or later.

          Given that we had zero reports of random lockups with 1.3.0 yet (would love to hear from Jason who has 1.3.x running on IrisCouch for a few weeks), I currently opt for B).

          People who should look at this:

          • Jason
          • Paul
          • Dave
          • Anyone else who is interested.
          Show
          Jan Lehnardt added a comment - And as a point of confirmation, master with the patch reverted works as expected. Our options: A) revert and release 1.3.0 understand the root cause and work around it if possible and re-land in 1.4 or later. B) revert and reintroduce with a OS-switch that disables the code on Win32. understand the root cause and work around it if possible and re-enable in Win32 in 1.4 or later. Given that we had zero reports of random lockups with 1.3.0 yet (would love to hear from Jason who has 1.3.x running on IrisCouch for a few weeks), I currently opt for B). People who should look at this: Jason Paul Dave Anyone else who is interested.
          Jan Lehnardt made changes -
          Skill Level Committers Level (Medium to Hard) [ 10446 ] Guru Level (Everyone buy this person a beer at the next conference!) [ 10461 ]
          Hide
          Dave Cottlehuber added a comment -

          After a long night of bisectus interruptus, I get the same result:

          erl@werl /relax/couchdb
          $ git bisect good
          a851c6e5150d14221ca018587d76214856c1555a is the first bad commit
          commit a851c6e5150d14221ca018587d76214856c1555a
          Author: Filipe David Borba Manana <fdmanana@apache.org>
          Date: Sun Nov 6 14:25:04 2011 +0000

          More efficient communication with the view server

          This change makes the communication between the Erlang VM and
          an external view server (couchjs for e.g.) more efficient by
          writing a series of commands into the port and reading all the
          responses from the external view server after doing all those
          writes. This minimizes the amount of time each endpoint spends
          blocked reading from the port.

          COUCHDB-1334

          & doing a clean build from 1.3.x just reverting this patch works on both my build boxes (this time :/).

          +0 on A in general given that many of us have been using 1.3.x actively for months on other platforms without issue.

          +1 for B.

          Show
          Dave Cottlehuber added a comment - After a long night of bisectus interruptus, I get the same result: erl@werl /relax/couchdb $ git bisect good a851c6e5150d14221ca018587d76214856c1555a is the first bad commit commit a851c6e5150d14221ca018587d76214856c1555a Author: Filipe David Borba Manana <fdmanana@apache.org> Date: Sun Nov 6 14:25:04 2011 +0000 More efficient communication with the view server This change makes the communication between the Erlang VM and an external view server (couchjs for e.g.) more efficient by writing a series of commands into the port and reading all the responses from the external view server after doing all those writes. This minimizes the amount of time each endpoint spends blocked reading from the port. COUCHDB-1334 & doing a clean build from 1.3.x just reverting this patch works on both my build boxes (this time :/). +0 on A in general given that many of us have been using 1.3.x actively for months on other platforms without issue. +1 for B.
          Hide
          Dave Cottlehuber added a comment -

          OK, closing this as we have resolved the nuke_dir situation now, and further discussion on the COUCHDB-1334 issue should take place on the re-opened ticket. Thanks everybody who has helped on this.

          Adam's patch was applied to master + 1.3.x for the underlying issue; COUCHDB-1334 remains in place on master but reverted for 1.3.x.

          Show
          Dave Cottlehuber added a comment - OK, closing this as we have resolved the nuke_dir situation now, and further discussion on the COUCHDB-1334 issue should take place on the re-opened ticket. Thanks everybody who has helped on this. Adam's patch was applied to master + 1.3.x for the underlying issue; COUCHDB-1334 remains in place on master but reverted for 1.3.x.
          Dave Cottlehuber made changes -
          Link This issue is broken by COUCHDB-1334 [ COUCHDB-1334 ]
          Dave Cottlehuber made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Hide
          ASF subversion and git services added a comment -

          Commit 6ffc52796182a8d3673f3648c59c7e9abddf4fa2 in branch refs/heads/1334-revert-feature-view-server-pipelining from Dave Cottlehuber
          [ https://git-wip-us.apache.org/repos/asf?p=couchdb.git;h=6ffc527 ]

          COUCHDB-1334 - revert "More efficient communication with the view server"

          This reverts commit a851c6e

          Show
          ASF subversion and git services added a comment - Commit 6ffc52796182a8d3673f3648c59c7e9abddf4fa2 in branch refs/heads/1334-revert-feature-view-server-pipelining from Dave Cottlehuber [ https://git-wip-us.apache.org/repos/asf?p=couchdb.git;h=6ffc527 ] COUCHDB-1334 - revert "More efficient communication with the view server" This reverts commit a851c6e COUCHDB-1334 breaks with Windows + couchjs in unexplained ways reducing to 1 concurrent query server is not sufficient Testing with open_port options overlapped_io was not in itself sufficient http://erlang.org/doc/man/erlang.html find overlapped_io Refer history in COUCHDB-1346
          Hide
          ASF subversion and git services added a comment -

          Commit 6ffc52796182a8d3673f3648c59c7e9abddf4fa2 in branch refs/heads/master from Dave Cottlehuber
          [ https://git-wip-us.apache.org/repos/asf?p=couchdb.git;h=6ffc527 ]

          COUCHDB-1334 - revert "More efficient communication with the view server"

          This reverts commit a851c6e

          Show
          ASF subversion and git services added a comment - Commit 6ffc52796182a8d3673f3648c59c7e9abddf4fa2 in branch refs/heads/master from Dave Cottlehuber [ https://git-wip-us.apache.org/repos/asf?p=couchdb.git;h=6ffc527 ] COUCHDB-1334 - revert "More efficient communication with the view server" This reverts commit a851c6e COUCHDB-1334 breaks with Windows + couchjs in unexplained ways reducing to 1 concurrent query server is not sufficient Testing with open_port options overlapped_io was not in itself sufficient http://erlang.org/doc/man/erlang.html find overlapped_io Refer history in COUCHDB-1346

            People

            • Assignee:
              Adam Kocoloski
              Reporter:
              Dave Cottlehuber
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development