Uploaded image for project: 'CouchDB'
  1. CouchDB
  2. COUCHDB-3402

JS: dev/run timing out starting up nodes

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Test Suite
    • Labels:
      None

      Description

      Seen this a bunch of times recently in the Jenkins infrastructure. May be a symptom of ASF Jenkins nodes being overloaded.

      make[1]: Entering directory `/usr/src/couchdb/apache-couchdb-2.0.0-28dd801'
      # This might help with emfile errors during `make javascript`: ulimit -n 10240
      Failed to start all the nodes. Check the dev/logs/*.log for errors.
      make[1]: *** [javascript] Error 1
      make[1]: Leaving directory `/usr/src/couchdb/apache-couchdb-2.0.0-28dd801'
      make: *** [check] Error 2
      

        Activity

        Hide
        wohali Joan Touzet added a comment - - edited

        Lots more instances of this, and I finally got a local reproduce of this one. We have a crash in mem3:

        [info] 2017-04-30T03:16:19.884644Z node1@127.0.0.1 <0.210.0> -------- Apache CouchDB has started on http://127.0.0.1:15986/
        [info] 2017-04-30T03:16:19.884870Z node1@127.0.0.1 <0.7.0> -------- Application couch started on node 'node1@127.0.0.1'
        [info] 2017-04-30T03:16:19.885045Z node1@127.0.0.1 <0.7.0> -------- Application ets_lru started on node 'node1@127.0.0.1'
        [info] 2017-04-30T03:16:19.886104Z node1@127.0.0.1 <0.7.0> -------- Application rexi started on node 'node1@127.0.0.1'
        [info] 2017-04-30T03:16:19.887183Z node1@127.0.0.1 <0.216.0> -------- open_result error {not_found,no_db_file} for _nodes
        [info] 2017-04-30T03:16:20.034961Z node1@127.0.0.1 <0.216.0> -------- open_result error {not_found,no_db_file} for _dbs
        [error] 2017-04-30T03:16:20.036030Z node1@127.0.0.1 <0.283.0> -------- Supervisor mem3_sup had child mem3_shards started with mem3_shards:start_link() at undefined exit with reason no match of right hand value file_exists at mem3_shards:get_update_seq/0(line:318) <= mem3_shards:init/1(line:206) <= gen_server:init_it/6(line:306) <= proc_lib:init_p_do_apply/3(line:237) in context start_error
        [error] 2017-04-30T03:16:20.036596Z node1@127.0.0.1 <0.298.0> -------- CRASH REPORT Process  (<0.298.0>) with 0 neighbors exited with reason: no match of right hand value file_exists at mem3_shards:get_update_seq/0(line:318) <= mem3_shards:init/1(line:206) <= gen_server:init_it/6(line:306) <= proc_lib:init_p_do_apply/3(line:237) at gen_server:init_it/6(line:330) <= proc_lib:init_p_do_apply/3(line:237); initial_call: {mem3_shards,init,['Argument__1']}, ancestors: [mem3_sup,<0.282.0>], messages: [], links: [<0.283.0>], dictionary: [], trap_exit: false, status: running, heap_size: 610, stack_size: 27, reductions: 374
        {"init terminating in do_boot",{{error,{{shutdown,{failed_to_start_child,mem3_shards,{{badmatch,file_exists},[{mem3_shards,get_update_seq,0,[{file,"src/mem3_shards.erl"},{line,318}]},{mem3_shards,init,1,[{file,"src/mem3_shards.erl"},{line,206}]},{gen_server,init_it,6,[{file,"gen_server.erl"},{line,306}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,237}]}]}}},{mem3_app,start,[normal,[]]}}},[{boot_node,start_app,3,[{file,"dev/boot_node.erl"},{line,134}]},{lists,foldl,3,[{file,"lists.erl"},{line,1261}]},{boot_node,start_app,3,[{file,"dev/boot_node.erl"},{line,124}]},{lists,foldl,3,[{file,"lists.erl"},{line,1261}]},{boot_node,start_app,3,[{file,"dev/boot_node.erl"},{line,124}]},{lists,foldl,3,[{file,"lists.erl"},{line,1261}]},{boot_node,start_app,3,[{file,"dev/boot_node.erl"},{line,124}]},{lists,foldl,3,[{file,"lists.erl"},{line,1261}]}]}}^M
        [os_mon] memory supervisor port (memsup): Erlang has closed^M
        [os_mon] cpu supervisor port (cpu_sup): Erlang has closed^M
        ^M
        Crash dump was written to: erl_crash.dump^M
        init terminating in do_boot ()^M
        

        At this point, the contents of dev/lib/node1/data are:

        $ ls -la dev/lib/node1/data/
        total 40
        drwxr-xr-x 3 joant joant 4096 Apr 29 23:27 ./
        drwxr-xr-x 4 joant joant 4096 Apr 29 23:27 ../
        -rw-r--r-- 1 joant joant  156 Apr 29 23:27 _dbs.couch
        drwxr-xr-x 2 joant joant 4096 Apr 29 23:27 .delete/
        -rw-r--r-- 1 joant joant 8370 Apr 29 23:27 _nodes.couch
        -rw-r--r-- 1 joant joant 8376 Apr 29 23:27 _users.couch
        

        Is this a race in startup? Nick Vatamaniuc any idea?

        Show
        wohali Joan Touzet added a comment - - edited Lots more instances of this, and I finally got a local reproduce of this one. We have a crash in mem3: [info] 2017-04-30T03:16:19.884644Z node1@127.0.0.1 <0.210.0> -------- Apache CouchDB has started on http://127.0.0.1:15986/ [info] 2017-04-30T03:16:19.884870Z node1@127.0.0.1 <0.7.0> -------- Application couch started on node 'node1@127.0.0.1' [info] 2017-04-30T03:16:19.885045Z node1@127.0.0.1 <0.7.0> -------- Application ets_lru started on node 'node1@127.0.0.1' [info] 2017-04-30T03:16:19.886104Z node1@127.0.0.1 <0.7.0> -------- Application rexi started on node 'node1@127.0.0.1' [info] 2017-04-30T03:16:19.887183Z node1@127.0.0.1 <0.216.0> -------- open_result error {not_found,no_db_file} for _nodes [info] 2017-04-30T03:16:20.034961Z node1@127.0.0.1 <0.216.0> -------- open_result error {not_found,no_db_file} for _dbs [error] 2017-04-30T03:16:20.036030Z node1@127.0.0.1 <0.283.0> -------- Supervisor mem3_sup had child mem3_shards started with mem3_shards:start_link() at undefined exit with reason no match of right hand value file_exists at mem3_shards:get_update_seq/0(line:318) <= mem3_shards:init/1(line:206) <= gen_server:init_it/6(line:306) <= proc_lib:init_p_do_apply/3(line:237) in context start_error [error] 2017-04-30T03:16:20.036596Z node1@127.0.0.1 <0.298.0> -------- CRASH REPORT Process (<0.298.0>) with 0 neighbors exited with reason: no match of right hand value file_exists at mem3_shards:get_update_seq/0(line:318) <= mem3_shards:init/1(line:206) <= gen_server:init_it/6(line:306) <= proc_lib:init_p_do_apply/3(line:237) at gen_server:init_it/6(line:330) <= proc_lib:init_p_do_apply/3(line:237); initial_call: {mem3_shards,init,['Argument__1']}, ancestors: [mem3_sup,<0.282.0>], messages: [], links: [<0.283.0>], dictionary: [], trap_exit: false, status: running, heap_size: 610, stack_size: 27, reductions: 374 {"init terminating in do_boot",{{error,{{shutdown,{failed_to_start_child,mem3_shards,{{badmatch,file_exists},[{mem3_shards,get_update_seq,0,[{file,"src/mem3_shards.erl"},{line,318}]},{mem3_shards,init,1,[{file,"src/mem3_shards.erl"},{line,206}]},{gen_server,init_it,6,[{file,"gen_server.erl"},{line,306}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,237}]}]}}},{mem3_app,start,[normal,[]]}}},[{boot_node,start_app,3,[{file,"dev/boot_node.erl"},{line,134}]},{lists,foldl,3,[{file,"lists.erl"},{line,1261}]},{boot_node,start_app,3,[{file,"dev/boot_node.erl"},{line,124}]},{lists,foldl,3,[{file,"lists.erl"},{line,1261}]},{boot_node,start_app,3,[{file,"dev/boot_node.erl"},{line,124}]},{lists,foldl,3,[{file,"lists.erl"},{line,1261}]},{boot_node,start_app,3,[{file,"dev/boot_node.erl"},{line,124}]},{lists,foldl,3,[{file,"lists.erl"},{line,1261}]}]}}^M [os_mon] memory supervisor port (memsup): Erlang has closed^M [os_mon] cpu supervisor port (cpu_sup): Erlang has closed^M ^M Crash dump was written to: erl_crash.dump^M init terminating in do_boot ()^M At this point, the contents of dev/lib/node1/data are: $ ls -la dev/lib/node1/data/ total 40 drwxr-xr-x 3 joant joant 4096 Apr 29 23:27 ./ drwxr-xr-x 4 joant joant 4096 Apr 29 23:27 ../ -rw-r--r-- 1 joant joant 156 Apr 29 23:27 _dbs.couch drwxr-xr-x 2 joant joant 4096 Apr 29 23:27 .delete/ -rw-r--r-- 1 joant joant 8370 Apr 29 23:27 _nodes.couch -rw-r--r-- 1 joant joant 8376 Apr 29 23:27 _users.couch Is this a race in startup? Nick Vatamaniuc any idea?
        Hide
        wohali Joan Touzet added a comment -

        Compare this with a successful startup on the same machine:

        [info] 2017-04-30T03:30:03.966197Z node1@127.0.0.1 <0.210.0> -------- Apache CouchDB has started on http://127.0.0.1:15986/
        [info] 2017-04-30T03:30:03.966474Z node1@127.0.0.1 <0.7.0> -------- Application couch started on node 'node1@127.0.0.1'
        [info] 2017-04-30T03:30:03.966574Z node1@127.0.0.1 <0.7.0> -------- Application ets_lru started on node 'node1@127.0.0.1'
        [info] 2017-04-30T03:30:03.968188Z node1@127.0.0.1 <0.216.0> -------- open_result error {not_found,no_db_file} for _nodes
        [info] 2017-04-30T03:30:03.968315Z node1@127.0.0.1 <0.7.0> -------- Application rexi started on node 'node1@127.0.0.1'
        [info] 2017-04-30T03:30:04.083650Z node1@127.0.0.1 <0.216.0> -------- open_result error {not_found,no_db_file} for _dbs
        [error] 2017-04-30T03:30:04.084361Z node1@127.0.0.1 emulator -------- Error in process <0.298.0> on node 'node1@127.0.0.1' with exit value: {{badmatch,file_exists},[{mem3_shards,fold,2,[{file,"src/mem3_shards.erl"},{line,159}]},{mem3_sync,initial_sync,1,[{file,"src/mem3_sync.erl"},{line,241}]}]}
        [info] 2017-04-30T03:30:04.143033Z node1@127.0.0.1 <0.7.0> -------- Application mem3 started on node 'node1@127.0.0.1'
        [info] 2017-04-30T03:30:04.143250Z node1@127.0.0.1 <0.7.0> -------- Application fabric started on node 'node1@127.0.0.1'
        [error] 2017-04-30T03:30:04.145509Z node1@127.0.0.1 emulator -------- Error in process <0.337.0> on node 'node1@127.0.0.1' with exit value: {database_does_not_exist,[{mem3_shards,load_shards_from_db,"_users",[{file,"src/mem3_shards.erl"},{line,397}]},{mem3_shards,load_shards_from_disk,1,[{file,"src/mem3_shards.erl"},{line,372}]},{mem3_shards,load_shards_from_disk...
        [notice] 2017-04-30T03:30:04.145701Z node1@127.0.0.1 <0.336.0> -------- chttpd_auth_cache changes listener died database_does_not_exist at mem3_shards:load_shards_from_db/6(line:397) <= mem3_shards:load_shards_from_disk/1(line:372) <= mem3_shards:load_shards_from_disk/2(line:401) <= mem3_shards:for_docid/3(line:90) <= fabric_doc_open:go/3(line:38) <= chttpd_auth_cache:ensure_auth_ddoc_exists/2(line:187) <= chttpd_auth_cache:listen_for_changes/1(line:134)
        [info] 2017-04-30T03:30:04.152381Z node1@127.0.0.1 <0.7.0> -------- Application chttpd started on node 'node1@127.0.0.1'
        [info] 2017-04-30T03:30:04.154573Z node1@127.0.0.1 <0.7.0> -------- Application setup started on node 'node1@127.0.0.1'
        [info] 2017-04-30T03:30:04.154724Z node1@127.0.0.1 <0.7.0> -------- Application couch_peruser started on node 'node1@127.0.0.1'
        [info] 2017-04-30T03:30:04.157003Z node1@127.0.0.1 <0.216.0> -------- open_result error {not_found,no_db_file} for _replicator
        [notice] 2017-04-30T03:30:04.205537Z node1@127.0.0.1 <0.352.0> -------- creating replicator ddoc <<"_replicator">>
        [info] 2017-04-30T03:30:04.265330Z node1@127.0.0.1 <0.7.0> -------- Application couch_replicator started on node 'node1@127.0.0.1'
        [info] 2017-04-30T03:30:04.265686Z node1@127.0.0.1 <0.7.0> -------- Application bear started on node 'node1@127.0.0.1'
        [info] 2017-04-30T03:30:04.269805Z node1@127.0.0.1 <0.7.0> -------- Application global_changes started on node 'node1@127.0.0.1'
        [info] 2017-04-30T03:30:04.269898Z node1@127.0.0.1 <0.7.0> -------- Application couch_plugins started on node 'node1@127.0.0.1'
        [info] 2017-04-30T03:30:04.270440Z node1@127.0.0.1 <0.7.0> -------- Application runtime_tools started on node 'node1@127.0.0.1'
        [info] 2017-04-30T03:30:04.271069Z node1@127.0.0.1 <0.7.0> -------- Application ddoc_cache started on node 'node1@127.0.0.1'
        ...
        
        Show
        wohali Joan Touzet added a comment - Compare this with a successful startup on the same machine: [info] 2017-04-30T03:30:03.966197Z node1@127.0.0.1 <0.210.0> -------- Apache CouchDB has started on http://127.0.0.1:15986/ [info] 2017-04-30T03:30:03.966474Z node1@127.0.0.1 <0.7.0> -------- Application couch started on node 'node1@127.0.0.1' [info] 2017-04-30T03:30:03.966574Z node1@127.0.0.1 <0.7.0> -------- Application ets_lru started on node 'node1@127.0.0.1' [info] 2017-04-30T03:30:03.968188Z node1@127.0.0.1 <0.216.0> -------- open_result error {not_found,no_db_file} for _nodes [info] 2017-04-30T03:30:03.968315Z node1@127.0.0.1 <0.7.0> -------- Application rexi started on node 'node1@127.0.0.1' [info] 2017-04-30T03:30:04.083650Z node1@127.0.0.1 <0.216.0> -------- open_result error {not_found,no_db_file} for _dbs [error] 2017-04-30T03:30:04.084361Z node1@127.0.0.1 emulator -------- Error in process <0.298.0> on node 'node1@127.0.0.1' with exit value: {{badmatch,file_exists},[{mem3_shards,fold,2,[{file,"src/mem3_shards.erl"},{line,159}]},{mem3_sync,initial_sync,1,[{file,"src/mem3_sync.erl"},{line,241}]}]} [info] 2017-04-30T03:30:04.143033Z node1@127.0.0.1 <0.7.0> -------- Application mem3 started on node 'node1@127.0.0.1' [info] 2017-04-30T03:30:04.143250Z node1@127.0.0.1 <0.7.0> -------- Application fabric started on node 'node1@127.0.0.1' [error] 2017-04-30T03:30:04.145509Z node1@127.0.0.1 emulator -------- Error in process <0.337.0> on node 'node1@127.0.0.1' with exit value: {database_does_not_exist,[{mem3_shards,load_shards_from_db,"_users",[{file,"src/mem3_shards.erl"},{line,397}]},{mem3_shards,load_shards_from_disk,1,[{file,"src/mem3_shards.erl"},{line,372}]},{mem3_shards,load_shards_from_disk... [notice] 2017-04-30T03:30:04.145701Z node1@127.0.0.1 <0.336.0> -------- chttpd_auth_cache changes listener died database_does_not_exist at mem3_shards:load_shards_from_db/6(line:397) <= mem3_shards:load_shards_from_disk/1(line:372) <= mem3_shards:load_shards_from_disk/2(line:401) <= mem3_shards:for_docid/3(line:90) <= fabric_doc_open:go/3(line:38) <= chttpd_auth_cache:ensure_auth_ddoc_exists/2(line:187) <= chttpd_auth_cache:listen_for_changes/1(line:134) [info] 2017-04-30T03:30:04.152381Z node1@127.0.0.1 <0.7.0> -------- Application chttpd started on node 'node1@127.0.0.1' [info] 2017-04-30T03:30:04.154573Z node1@127.0.0.1 <0.7.0> -------- Application setup started on node 'node1@127.0.0.1' [info] 2017-04-30T03:30:04.154724Z node1@127.0.0.1 <0.7.0> -------- Application couch_peruser started on node 'node1@127.0.0.1' [info] 2017-04-30T03:30:04.157003Z node1@127.0.0.1 <0.216.0> -------- open_result error {not_found,no_db_file} for _replicator [notice] 2017-04-30T03:30:04.205537Z node1@127.0.0.1 <0.352.0> -------- creating replicator ddoc <<"_replicator">> [info] 2017-04-30T03:30:04.265330Z node1@127.0.0.1 <0.7.0> -------- Application couch_replicator started on node 'node1@127.0.0.1' [info] 2017-04-30T03:30:04.265686Z node1@127.0.0.1 <0.7.0> -------- Application bear started on node 'node1@127.0.0.1' [info] 2017-04-30T03:30:04.269805Z node1@127.0.0.1 <0.7.0> -------- Application global_changes started on node 'node1@127.0.0.1' [info] 2017-04-30T03:30:04.269898Z node1@127.0.0.1 <0.7.0> -------- Application couch_plugins started on node 'node1@127.0.0.1' [info] 2017-04-30T03:30:04.270440Z node1@127.0.0.1 <0.7.0> -------- Application runtime_tools started on node 'node1@127.0.0.1' [info] 2017-04-30T03:30:04.271069Z node1@127.0.0.1 <0.7.0> -------- Application ddoc_cache started on node 'node1@127.0.0.1' ...
        Hide
        wohali Joan Touzet added a comment -

        With a little trickery, I managed to get stack traces showing where the collision happens.

        Path 1: gen_server:init_it/6 -> mem3_shards:init/1 -> mem3_shards:get_update_seq/0 -> couch_server:create/2
        Path 2: mem3_sync:initial_sync/1 -> mem3_shards:fold/2 -> couch_server:create/2

        It appears that Paths 1 and 2 start nearly simultaneously, with the bug occurring when Path 2's couch_server:create/2 completes before Path 1's. This leads to the file_exists error bubbling back up to mem3_shards:get_update_seq/0 which doesn't know what to do with it.

        Show
        wohali Joan Touzet added a comment - With a little trickery, I managed to get stack traces showing where the collision happens. Path 1: gen_server:init_it/6 -> mem3_shards:init/1 -> mem3_shards:get_update_seq/0 -> couch_server:create/2 Path 2: mem3_sync:initial_sync/1 -> mem3_shards:fold/2 -> couch_server:create/2 It appears that Paths 1 and 2 start nearly simultaneously, with the bug occurring when Path 2's couch_server:create/2 completes before Path 1's. This leads to the file_exists error bubbling back up to mem3_shards:get_update_seq/0 which doesn't know what to do with it.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 2689507fc0f4a4a3731df34d3634bb1bcd4afbc3 in couchdb's branch refs/heads/master from Joan Touzet
        [ https://gitbox.apache.org/repos/asf?p=couchdb.git;h=2689507 ]

        Fix error on race condition in mem3 startup

        During mem3 startup, 2 paths attempt to call `couch_server:create/2` on
        `_dbs`:

        ```
        gen_server:init_it/6
        -> mem3_shards:init/1
        -> mem3_shards:get_update_seq/0
        -> couch_server:create/2
        ```

        and

        ```
        mem3_sync:initial_sync/1
        -> mem3_shards:fold/2
        -> couch_server:create/2
        ```

        Normally, the first path completes before the second. If the second path
        finishes first, the first path fails because it does not expect a
        `file_exists` response.

        This patch simply retries mem3_util:ensure_exists/1 once if it gets back
        a `file_exists` response. Any failures past this point are not handled.

        Fixes COUCHDB-3402.

        Show
        jira-bot ASF subversion and git services added a comment - Commit 2689507fc0f4a4a3731df34d3634bb1bcd4afbc3 in couchdb's branch refs/heads/master from Joan Touzet [ https://gitbox.apache.org/repos/asf?p=couchdb.git;h=2689507 ] Fix error on race condition in mem3 startup During mem3 startup, 2 paths attempt to call `couch_server:create/2` on `_dbs`: ``` gen_server:init_it/6 -> mem3_shards:init/1 -> mem3_shards:get_update_seq/0 -> couch_server:create/2 ``` and ``` mem3_sync:initial_sync/1 -> mem3_shards:fold/2 -> couch_server:create/2 ``` Normally, the first path completes before the second. If the second path finishes first, the first path fails because it does not expect a `file_exists` response. This patch simply retries mem3_util:ensure_exists/1 once if it gets back a `file_exists` response. Any failures past this point are not handled. Fixes COUCHDB-3402 .
        Hide
        wohali Joan Touzet added a comment -

        https://github.com/apache/couchdb/pull/501 PR issued for this commit.

        Show
        wohali Joan Touzet added a comment - https://github.com/apache/couchdb/pull/501 PR issued for this commit.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1aa48ef245a9f5d8d336d43070767363ef092ed1 in couchdb's branch refs/heads/3402-mem3-race from Joan Touzet
        [ https://gitbox.apache.org/repos/asf?p=couchdb.git;h=1aa48ef ]

        Fix error on race condition in mem3 startup

        During mem3 startup, 2 paths attempt to call `couch_server:create/2` on
        `_dbs`:

        ```
        gen_server:init_it/6
        -> mem3_shards:init/1
        -> mem3_shards:get_update_seq/0
        -> couch_server:create/2
        ```

        and

        ```
        mem3_sync:initial_sync/1
        -> mem3_shards:fold/2
        -> couch_server:create/2
        ```

        Normally, the first path completes before the second. If the second path
        finishes first, the first path fails because it does not expect a
        `file_exists` response.

        This patch makes `mem3_util:ensure_enxists/1` more robust in the face of
        a race to create `_dbs`.

        Fixes COUCHDB-3402.

        Approved by @davisp and @iilyak

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1aa48ef245a9f5d8d336d43070767363ef092ed1 in couchdb's branch refs/heads/3402-mem3-race from Joan Touzet [ https://gitbox.apache.org/repos/asf?p=couchdb.git;h=1aa48ef ] Fix error on race condition in mem3 startup During mem3 startup, 2 paths attempt to call `couch_server:create/2` on `_dbs`: ``` gen_server:init_it/6 -> mem3_shards:init/1 -> mem3_shards:get_update_seq/0 -> couch_server:create/2 ``` and ``` mem3_sync:initial_sync/1 -> mem3_shards:fold/2 -> couch_server:create/2 ``` Normally, the first path completes before the second. If the second path finishes first, the first path fails because it does not expect a `file_exists` response. This patch makes `mem3_util:ensure_enxists/1` more robust in the face of a race to create `_dbs`. Fixes COUCHDB-3402 . Approved by @davisp and @iilyak
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 81ee7c5ac71e617a03e967b4fc5d0358f4ba9459 in couchdb's branch refs/heads/master from Joan Touzet
        [ https://gitbox.apache.org/repos/asf?p=couchdb.git;h=81ee7c5 ]

        Fix error on race condition in mem3 startup

        During mem3 startup, 2 paths attempt to call `couch_server:create/2` on
        `_dbs`:

        ```
        gen_server:init_it/6
        -> mem3_shards:init/1
        -> mem3_shards:get_update_seq/0
        -> couch_server:create/2
        ```

        and

        ```
        mem3_sync:initial_sync/1
        -> mem3_shards:fold/2
        -> couch_server:create/2
        ```

        Normally, the first path completes before the second. If the second path
        finishes first, the first path fails because it does not expect a
        `file_exists` response.

        This patch makes `mem3_util:ensure_enxists/1` more robust in the face of
        a race to create `_dbs`.

        Fixes COUCHDB-3402.

        Approved by @davisp and @iilyak

        Show
        jira-bot ASF subversion and git services added a comment - Commit 81ee7c5ac71e617a03e967b4fc5d0358f4ba9459 in couchdb's branch refs/heads/master from Joan Touzet [ https://gitbox.apache.org/repos/asf?p=couchdb.git;h=81ee7c5 ] Fix error on race condition in mem3 startup During mem3 startup, 2 paths attempt to call `couch_server:create/2` on `_dbs`: ``` gen_server:init_it/6 -> mem3_shards:init/1 -> mem3_shards:get_update_seq/0 -> couch_server:create/2 ``` and ``` mem3_sync:initial_sync/1 -> mem3_shards:fold/2 -> couch_server:create/2 ``` Normally, the first path completes before the second. If the second path finishes first, the first path fails because it does not expect a `file_exists` response. This patch makes `mem3_util:ensure_enxists/1` more robust in the face of a race to create `_dbs`. Fixes COUCHDB-3402 . Approved by @davisp and @iilyak
        Hide
        wohali Joan Touzet added a comment -

        After patching, I ran startup/shutdown in a loop 30x and was unable to reproduce (used to occur every 3-4 startups on my machine). Closing.

        Show
        wohali Joan Touzet added a comment - After patching, I ran startup/shutdown in a loop 30x and was unable to reproduce (used to occur every 3-4 startups on my machine). Closing.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 81ee7c5ac71e617a03e967b4fc5d0358f4ba9459 in couchdb's branch refs/heads/COUCHDB-3298-optimize-writing-kv-nodes from Joan Touzet
        [ https://gitbox.apache.org/repos/asf?p=couchdb.git;h=81ee7c5 ]

        Fix error on race condition in mem3 startup

        During mem3 startup, 2 paths attempt to call `couch_server:create/2` on
        `_dbs`:

        ```
        gen_server:init_it/6
        -> mem3_shards:init/1
        -> mem3_shards:get_update_seq/0
        -> couch_server:create/2
        ```

        and

        ```
        mem3_sync:initial_sync/1
        -> mem3_shards:fold/2
        -> couch_server:create/2
        ```

        Normally, the first path completes before the second. If the second path
        finishes first, the first path fails because it does not expect a
        `file_exists` response.

        This patch makes `mem3_util:ensure_enxists/1` more robust in the face of
        a race to create `_dbs`.

        Fixes COUCHDB-3402.

        Approved by @davisp and @iilyak

        Show
        jira-bot ASF subversion and git services added a comment - Commit 81ee7c5ac71e617a03e967b4fc5d0358f4ba9459 in couchdb's branch refs/heads/ COUCHDB-3298 -optimize-writing-kv-nodes from Joan Touzet [ https://gitbox.apache.org/repos/asf?p=couchdb.git;h=81ee7c5 ] Fix error on race condition in mem3 startup During mem3 startup, 2 paths attempt to call `couch_server:create/2` on `_dbs`: ``` gen_server:init_it/6 -> mem3_shards:init/1 -> mem3_shards:get_update_seq/0 -> couch_server:create/2 ``` and ``` mem3_sync:initial_sync/1 -> mem3_shards:fold/2 -> couch_server:create/2 ``` Normally, the first path completes before the second. If the second path finishes first, the first path fails because it does not expect a `file_exists` response. This patch makes `mem3_util:ensure_enxists/1` more robust in the face of a race to create `_dbs`. Fixes COUCHDB-3402 . Approved by @davisp and @iilyak
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 9875e8b30017300c6541dd2dfec4f2bd148bb968 in couchdb's branch refs/heads/2.1.x from Joan Touzet
        [ https://gitbox.apache.org/repos/asf?p=couchdb.git;h=9875e8b ]

        Fix error on race condition in mem3 startup

        During mem3 startup, 2 paths attempt to call `couch_server:create/2` on
        `_dbs`:

        ```
        gen_server:init_it/6
        -> mem3_shards:init/1
        -> mem3_shards:get_update_seq/0
        -> couch_server:create/2
        ```

        and

        ```
        mem3_sync:initial_sync/1
        -> mem3_shards:fold/2
        -> couch_server:create/2
        ```

        Normally, the first path completes before the second. If the second path
        finishes first, the first path fails because it does not expect a
        `file_exists` response.

        This patch makes `mem3_util:ensure_enxists/1` more robust in the face of
        a race to create `_dbs`.

        Fixes COUCHDB-3402.

        Approved by @davisp and @iilyak

        Show
        jira-bot ASF subversion and git services added a comment - Commit 9875e8b30017300c6541dd2dfec4f2bd148bb968 in couchdb's branch refs/heads/2.1.x from Joan Touzet [ https://gitbox.apache.org/repos/asf?p=couchdb.git;h=9875e8b ] Fix error on race condition in mem3 startup During mem3 startup, 2 paths attempt to call `couch_server:create/2` on `_dbs`: ``` gen_server:init_it/6 -> mem3_shards:init/1 -> mem3_shards:get_update_seq/0 -> couch_server:create/2 ``` and ``` mem3_sync:initial_sync/1 -> mem3_shards:fold/2 -> couch_server:create/2 ``` Normally, the first path completes before the second. If the second path finishes first, the first path fails because it does not expect a `file_exists` response. This patch makes `mem3_util:ensure_enxists/1` more robust in the face of a race to create `_dbs`. Fixes COUCHDB-3402 . Approved by @davisp and @iilyak

          People

          • Assignee:
            wohali Joan Touzet
            Reporter:
            wohali Joan Touzet
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development