Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.11
    • Fix Version/s: None
    • Component/s: Database Core
    • Labels:
      None
    • Skill Level:
      Regular Contributors Level (Easy to Medium)

      Description

      If I kick off 10 replication tasks in quick succession, occasionally one or two of the replication tasks will die and not be resumed. It seems that the stat tracking is a little buggy, and under stress can eventually cause a permanent failure of the supervised replication task;

      [Fri, 11 Dec 2009 19:00:08 GMT] [error] [<0.80.0>] {error_report,<0.30.0>,
      {<0.80.0>,supervisor_report,
      [{supervisor,{local,couch_rep_sup}},

      {errorContext,shutdown_error}

      ,

      {reason,killed}

      ,
      {offender,
      [

      {pid,<0.6700.11>}

      ,

      {name,"fcbb13200a1618cf983b347f4d2c9835+create_target"}

      ,
      {mfa,
      {gen_server,start_link,
      [couch_rep,
      ["fcbb13200a1618cf983b347f4d2c9835",
      {[

      {<<"create_target">>,true}

      ,

      {<<"source">>,<<"http://node:5984/perf-p2">>}

      ,

      {<<"target">>,<<"perf-p2">>}

      ]},

      {user_ctx,null,[<<"_admin">>]}

      ],
      []]}},

      {restart_type,temporary}

      ,

      {shutdown,1}

      ,

      {child_type,worker}

      ]}]}}

      [Fri, 11 Dec 2009 19:00:08 GMT] [error] [emulator] Error in process <0.6705.11> with exit value: {badarg,[{ets,insert,[stats_hit_table,{{couchdb,open_os_files},-1}]},

      {couch_stats_collector,decrement,1}

      ]}

      1. 0001-changes-replication-timeouts-and-att.-fixes-COUCHDB-.patch
        4 kB
        Randall Leeds
      2. 0001-Cleanup-597-fixes.patch
        3 kB
        Randall Leeds
      3. 597_fixes.patch
        2 kB
        Randall Leeds
      4. couchdb_597.patch
        6 kB
        Randall Leeds

        Issue Links

          Activity

          Hide
          Robert Newson added a comment -

          Replication tasks are failing even if executed serially as long as databases are large enough (1.3 gb in this case). The fourth replication task has crashed.

          Stack traces from the end of my log while a replication tasks is hung/crashed;

          Tue, 15 Dec 2009 07:08:44 GMT] [error] [<0.49.0>] ** Generic server couch_task_status terminating

            • Last message in was {#Ref<0.0.1832.61391>,3}
              ** When Server state == nil
              ** Reason for termination ==
              ** {function_clause,
              [{couch_task_status,handle_info,[{#Ref<0.0.1832.61391>,3}

              ,nil]},

              {gen_server,handle_msg,5},
              {proc_lib,init_p_do_apply,3}]}

              Tue, 15 Dec 2009 07:08:44 GMT] [error] [<0.45.0>] {error_report,<0.23.0>,
              {<0.45.0>,supervisor_report,
              [{supervisor,{local,couch_primary_services}},
              {errorContext,child_terminated},
              {reason,
              {function_clause,
              [{couch_task_status,handle_info,[{#Ref<0.0.1832.61391>,3},nil]},
              {gen_server,handle_msg,5}

              ,

              {proc_lib,init_p_do_apply,3}]}},
              {offender,
              [{pid,<0.49.0>},
              {name,couch_task_status},
              {mfa,{couch_task_status,start_link,[]}},
              {restart_type,permanent},
              {shutdown,brutal_kill},
              {child_type,worker}]}]}}

              [Tue, 15 Dec 2009 07:08:51 GMT] [error] [<0.2720.204>] {error_report,<0.23.0>,
              {<0.2720.204>,crash_report,
              [[{initial_call,{couch_task_status,init,['Argument__1']}},
              {pid,<0.2720.204>},
              {registered_name,couch_task_status},
              {error_info,{exit,{{badmatch,[]},
              [{couch_task_status,handle_cast,2},
              {gen_server,handle_msg,5},
              {proc_lib,init_p_do_apply,3}

              ]},
              [

              {gen_server,terminate,6},
              {proc_lib,init_p_do_apply,3}]}},
              {ancestors,[couch_primary_services,couch_server_sup,<0.1.0>]},
              {messages,[]},
              {links,[<0.45.0>]},
              {dictionary,[]},
              {trap_exit,false},
              {status,running},
              {heap_size,377},
              {stack_size,24},
              {reductions,127}],
              []]}}

              [Tue, 15 Dec 2009 07:08:51 GMT] [error] [<0.45.0>] {error_report,<0.23.0>,
              {<0.45.0>,supervisor_report,
              [{supervisor,{local,couch_primary_services}},
              {errorContext,child_terminated},
              {reason,{{badmatch,[]},
              [{couch_task_status,handle_cast,2},
              {gen_server,handle_msg,5},
              {proc_lib,init_p_do_apply,3}]}},
              {offender,[{pid,<0.2720.204>},
              {name,couch_task_status},
              {mfa,{couch_task_status,start_link,[]}},
              {restart_type,permanent},
              {shutdown,brutal_kill},
              {child_type,worker}]}]}}

              [Tue, 15 Dec 2009 07:08:57 GMT] [error] [<0.4889.204>] ** Generic server couch_task_status terminating
              ** Last message in was {'$gen_cast',
              {update_status,<0.9558.169>,
              <<"Copied 146001 of 271595 changes (53%)">>}}
              ** When Server state == nil
              ** Reason for termination ==
              ** {{badmatch,[]},
              [{couch_task_status,handle_cast,2},
              {gen_server,handle_msg,5},
              {proc_lib,init_p_do_apply,3}]}


              [Tue, 15 Dec 2009 07:08:57 GMT] [error] [<0.4889.204>] {error_report,<0.23.0>,
              {<0.4889.204>,crash_report,
              [[{initial_call,{couch_task_status,init,['Argument__1']}},
              {pid,<0.4889.204>},
              {registered_name,couch_task_status},
              {error_info,{exit,{{badmatch,[]},
              [{couch_task_status,handle_cast,2},
              {gen_server,handle_msg,5},
              {proc_lib,init_p_do_apply,3}]},
              [{gen_server,terminate,6}

              ,

              {proc_lib,init_p_do_apply,3}]}},
              {ancestors,[couch_primary_services,couch_server_sup,<0.1.0>]},
              {messages,[]},
              {links,[<0.45.0>]},
              {dictionary,[]},
              {trap_exit,false},
              {status,running},
              {heap_size,377},
              {stack_size,24},
              {reductions,127}],
              []]}}

              [Tue, 15 Dec 2009 07:08:57 GMT] [error] [<0.45.0>] {error_report,<0.23.0>,
              {<0.45.0>,supervisor_report,
              [{supervisor,{local,couch_primary_services}},
              {errorContext,child_terminated},
              {reason,{{badmatch,[]},
              [{couch_task_status,handle_cast,2},
              {gen_server,handle_msg,5},
              {proc_lib,init_p_do_apply,3}

              ]}},
              {offender,[

              {pid,<0.4889.204>}

              ,

              {name,couch_task_status},
              {mfa,{couch_task_status,start_link,[]}},
              {restart_type,permanent},
              {shutdown,brutal_kill},
              {child_type,worker}]}]}}

              [Tue, 15 Dec 2009 07:09:02 GMT] [error] [<0.45.0>] {error_report,<0.23.0>,
              {<0.45.0>,supervisor_report,
              [{supervisor,{local,couch_primary_services}},
              {errorContext,shutdown},
              {reason,reached_max_restart_intensity},
              {offender,[{pid,<0.6117.204>},
              {name,couch_task_status}

              ,
              {mfa,{couch_task_status,start_link,[]}},

              {restart_type,permanent},
              {shutdown,brutal_kill},
              {child_type,worker}]}]}}

              [Tue, 15 Dec 2009 07:09:02 GMT] [error] [<0.60.0>] Exit on non-updater process: killed

              [Tue, 15 Dec 2009 07:09:02 GMT] [error] [<0.60.0>] ** Generic server couch_view terminating
              ** Last message in was {'EXIT',<0.61.0>,killed}
              ** When Server state == {server,"/var/lib/couchdb/0.10.0"}
              ** Reason for termination ==
              ** killed


              [Tue, 15 Dec 2009 07:09:02 GMT] [error] [<0.60.0>] {error_report,<0.23.0>,
              {<0.60.0>,crash_report,
              [[{initial_call,{couch_view,init,['Argument__1']}},
              {pid,<0.60.0>},
              {registered_name,couch_view},
              {error_info,{exit,killed,
              [{gen_server,terminate,6},
              {proc_lib,init_p_do_apply,3}]}},
              {ancestors,[couch_secondary_services,couch_server_sup, <0.1.0>]},
              {messages,[]},
              {links,[<0.52.0>]},
              {dictionary,[]},
              {trap_exit,true},
              {status,running},
              {heap_size,2584},
              {stack_size,24},
              {reductions,5320}],
              []]}}

              [Tue, 15 Dec 2009 07:09:02 GMT] [error] [<0.52.0>] {error_report,<0.23.0>,
              {<0.52.0>,supervisor_report,
              [{supervisor,{local,couch_secondary_services}},
              {errorContext,child_terminated},
              {reason,killed},
              {offender,[{pid,<0.60.0>},
              {name,view_manager},
              {mfa,{couch_view,start_link,[]}},
              {restart_type,permanent}

              ,

              {shutdown,brutal_kill}

              ,

              {child_type,worker}

              ]}]}}

          [Tue, 15 Dec 2009 07:08:44 GMT] [error] [<0.49.0>] {error_report,<0.23.0>,
          {<0.49.0>,crash_report,
          [[{initial_call,{couch_task_status,init,['Argument__1']}},

          {pid,<0.49.0>}

          ,

          {registered_name,couch_task_status}

          ,
          {error_info,
          {exit,
          {function_clause,
          [{couch_task_status,handle_info,
          [

          {#Ref<0.0.1832.61391>,3}

          ,nil]},

          {gen_server,handle_msg,5}

          ,

          {proc_lib,init_p_do_apply,3}]},
          [{gen_server,terminate,6},{proc_lib,init_p_do_apply,3}

          ]}},

          {ancestors,[couch_primary_services,couch_server_sup,<0.1.0>]}

          ,

          {messages,[]}

          ,

          {links,[<0.45.0>]}

          ,

          {dictionary,[]}

          ,

          {trap_exit,false}

          ,

          {status,running}

          ,

          {heap_size,2584}

          ,

          {stack_size,24}

          ,

          {reductions,191624}

          ],
          []]}}

          Show
          Robert Newson added a comment - Replication tasks are failing even if executed serially as long as databases are large enough (1.3 gb in this case). The fourth replication task has crashed. Stack traces from the end of my log while a replication tasks is hung/crashed; Tue, 15 Dec 2009 07:08:44 GMT] [error] [<0.49.0>] ** Generic server couch_task_status terminating Last message in was {#Ref<0.0.1832.61391>,3} ** When Server state == nil ** Reason for termination == ** {function_clause, [{couch_task_status,handle_info,[{#Ref<0.0.1832.61391>,3} ,nil]}, {gen_server,handle_msg,5}, {proc_lib,init_p_do_apply,3}]} Tue, 15 Dec 2009 07:08:44 GMT] [error] [<0.45.0>] {error_report,<0.23.0>, {<0.45.0>,supervisor_report, [{supervisor,{local,couch_primary_services}}, {errorContext,child_terminated}, {reason, {function_clause, [{couch_task_status,handle_info, [{#Ref<0.0.1832.61391>,3},nil] }, {gen_server,handle_msg,5} , {proc_lib,init_p_do_apply,3}]}}, {offender, [{pid,<0.49.0>}, {name,couch_task_status}, {mfa,{couch_task_status,start_link,[]}}, {restart_type,permanent}, {shutdown,brutal_kill}, {child_type,worker}]}]}} [Tue, 15 Dec 2009 07:08:51 GMT] [error] [<0.2720.204>] {error_report,<0.23.0>, {<0.2720.204>,crash_report, [[{initial_call,{couch_task_status,init, ['Argument__1'] }}, {pid,<0.2720.204>}, {registered_name,couch_task_status}, {error_info,{exit,{{badmatch,[]}, [{couch_task_status,handle_cast,2}, {gen_server,handle_msg,5}, {proc_lib,init_p_do_apply,3} ]}, [ {gen_server,terminate,6}, {proc_lib,init_p_do_apply,3}]}}, {ancestors,[couch_primary_services,couch_server_sup,<0.1.0>]}, {messages,[]}, {links,[<0.45.0>]}, {dictionary,[]}, {trap_exit,false}, {status,running}, {heap_size,377}, {stack_size,24}, {reductions,127}], []]}} [Tue, 15 Dec 2009 07:08:51 GMT] [error] [<0.45.0>] {error_report,<0.23.0>, {<0.45.0>,supervisor_report, [{supervisor,{local,couch_primary_services}}, {errorContext,child_terminated}, {reason,{{badmatch,[]}, [{couch_task_status,handle_cast,2}, {gen_server,handle_msg,5}, {proc_lib,init_p_do_apply,3}]}}, {offender,[{pid,<0.2720.204>}, {name,couch_task_status}, {mfa,{couch_task_status,start_link,[]}}, {restart_type,permanent}, {shutdown,brutal_kill}, {child_type,worker}]}]}} [Tue, 15 Dec 2009 07:08:57 GMT] [error] [<0.4889.204>] ** Generic server couch_task_status terminating ** Last message in was {'$gen_cast', {update_status,<0.9558.169>, <<"Copied 146001 of 271595 changes (53%)">>}} ** When Server state == nil ** Reason for termination == ** {{badmatch,[]}, [{couch_task_status,handle_cast,2}, {gen_server,handle_msg,5}, {proc_lib,init_p_do_apply,3}]} [Tue, 15 Dec 2009 07:08:57 GMT] [error] [<0.4889.204>] {error_report,<0.23.0>, {<0.4889.204>,crash_report, [[{initial_call,{couch_task_status,init, ['Argument__1'] }}, {pid,<0.4889.204>}, {registered_name,couch_task_status}, {error_info,{exit,{{badmatch,[]}, [{couch_task_status,handle_cast,2}, {gen_server,handle_msg,5}, {proc_lib,init_p_do_apply,3}]}, [{gen_server,terminate,6} , {proc_lib,init_p_do_apply,3}]}}, {ancestors,[couch_primary_services,couch_server_sup,<0.1.0>]}, {messages,[]}, {links,[<0.45.0>]}, {dictionary,[]}, {trap_exit,false}, {status,running}, {heap_size,377}, {stack_size,24}, {reductions,127}], []]}} [Tue, 15 Dec 2009 07:08:57 GMT] [error] [<0.45.0>] {error_report,<0.23.0>, {<0.45.0>,supervisor_report, [{supervisor,{local,couch_primary_services}}, {errorContext,child_terminated}, {reason,{{badmatch,[]}, [{couch_task_status,handle_cast,2}, {gen_server,handle_msg,5}, {proc_lib,init_p_do_apply,3} ]}}, {offender,[ {pid,<0.4889.204>} , {name,couch_task_status}, {mfa,{couch_task_status,start_link,[]}}, {restart_type,permanent}, {shutdown,brutal_kill}, {child_type,worker}]}]}} [Tue, 15 Dec 2009 07:09:02 GMT] [error] [<0.45.0>] {error_report,<0.23.0>, {<0.45.0>,supervisor_report, [{supervisor,{local,couch_primary_services}}, {errorContext,shutdown}, {reason,reached_max_restart_intensity}, {offender,[{pid,<0.6117.204>}, {name,couch_task_status} , {mfa,{couch_task_status,start_link,[]}}, {restart_type,permanent}, {shutdown,brutal_kill}, {child_type,worker}]}]}} [Tue, 15 Dec 2009 07:09:02 GMT] [error] [<0.60.0>] Exit on non-updater process: killed [Tue, 15 Dec 2009 07:09:02 GMT] [error] [<0.60.0>] ** Generic server couch_view terminating ** Last message in was {'EXIT',<0.61.0>,killed} ** When Server state == {server,"/var/lib/couchdb/0.10.0"} ** Reason for termination == ** killed [Tue, 15 Dec 2009 07:09:02 GMT] [error] [<0.60.0>] {error_report,<0.23.0>, {<0.60.0>,crash_report, [[{initial_call,{couch_view,init, ['Argument__1'] }}, {pid,<0.60.0>}, {registered_name,couch_view}, {error_info,{exit,killed, [{gen_server,terminate,6}, {proc_lib,init_p_do_apply,3}]}}, {ancestors,[couch_secondary_services,couch_server_sup, <0.1.0>]}, {messages,[]}, {links,[<0.52.0>]}, {dictionary,[]}, {trap_exit,true}, {status,running}, {heap_size,2584}, {stack_size,24}, {reductions,5320}], []]}} [Tue, 15 Dec 2009 07:09:02 GMT] [error] [<0.52.0>] {error_report,<0.23.0>, {<0.52.0>,supervisor_report, [{supervisor,{local,couch_secondary_services}}, {errorContext,child_terminated}, {reason,killed}, {offender,[{pid,<0.60.0>}, {name,view_manager}, {mfa,{couch_view,start_link,[]}}, {restart_type,permanent} , {shutdown,brutal_kill} , {child_type,worker} ]}]}} [Tue, 15 Dec 2009 07:08:44 GMT] [error] [<0.49.0>] {error_report,<0.23.0>, {<0.49.0>,crash_report, [[{initial_call,{couch_task_status,init, ['Argument__1'] }}, {pid,<0.49.0>} , {registered_name,couch_task_status} , {error_info, {exit, {function_clause, [{couch_task_status,handle_info, [ {#Ref<0.0.1832.61391>,3} ,nil]}, {gen_server,handle_msg,5} , {proc_lib,init_p_do_apply,3}]}, [{gen_server,terminate,6},{proc_lib,init_p_do_apply,3} ]}}, {ancestors,[couch_primary_services,couch_server_sup,<0.1.0>]} , {messages,[]} , {links,[<0.45.0>]} , {dictionary,[]} , {trap_exit,false} , {status,running} , {heap_size,2584} , {stack_size,24} , {reductions,191624} ], []]}}
          Hide
          Adam Kocoloski added a comment -

          The origin of this

          {#Ref<0.0.1832.61391>,3}

          message is a total mystery to me, but I know slyphon on IRC also saw this. We could guard against it by adding a catchall clause to couch_task_status:handle_info

          Show
          Adam Kocoloski added a comment - The origin of this {#Ref<0.0.1832.61391>,3} message is a total mystery to me, but I know slyphon on IRC also saw this. We could guard against it by adding a catchall clause to couch_task_status:handle_info
          Hide
          Paul Joseph Davis added a comment -

          I'm pretty sure that this is a message leaking from couch_ref_counter. The #Ref<...> variable is the same type of response from a call to erlang:monitor(process, Pid). And couch_ref_counter.erl stores refs like this with a tuple of

          {Ref, Count}

          which fits the message that's killing couch_task_status. That said, I don't have any idea how such a message is leaking out of the ref counter. My only guess is that there's a piece of code somewhere that's trying to catch exceptions and strips a badmatch or other error making this look like a message that was sent instead of an error propagation.

          Show
          Paul Joseph Davis added a comment - I'm pretty sure that this is a message leaking from couch_ref_counter. The #Ref<...> variable is the same type of response from a call to erlang:monitor(process, Pid). And couch_ref_counter.erl stores refs like this with a tuple of {Ref, Count} which fits the message that's killing couch_task_status. That said, I don't have any idea how such a message is leaking out of the ref counter. My only guess is that there's a piece of code somewhere that's trying to catch exceptions and strips a badmatch or other error making this look like a message that was sent instead of an error propagation.
          Hide
          Robert Newson added a comment -

          I can reproduce at least one problem with replication with the script below. If executed serially, the tasks complete, but if issued in parallel, they do not;

          #!/bin/sh

          URL=http://localhost:5984

          dd if=/dev/zero of=att bs=60k count=1

          curl -X DELETE $URL/db1
          curl -X DELETE $URL/db2
          curl -X DELETE $URL/db3
          curl -X DELETE $URL/db4
          curl -X DELETE $URL/db5
          curl -X PUT $URL/db1

          for COUNT in

          {0..100000}

          do
          curl -s -X PUT -d '

          {"text":"some text goes here, oh yes."}

          ' $URL/db1/doc$((COUNT * 2)) > /dev/null
          curl -s -H "Expect: " -X PUT --data-binary @att $URL/db1/doc$((COUNT * 2 + 1))/att > /dev/null
          done

          1. At least one of these tasks will fail.
            curl http://localhost:5984/_replicate -d " {\"source\":\"$URL/db1\",\"target\":\"db2\",\"create_target\":true}

            " &
            curl http://localhost:5984/_replicate -d "

            {\"source\":\"$URL/db1\",\"target\":\"db3\",\"create_target\":true}

            " &
            curl http://localhost:5984/_replicate -d "

            {\"source\":\"$URL/db1\",\"target\":\"db4\",\"create_target\":true}

            " &
            curl http://localhost:5984/_replicate -d "

            {\"source\":\"$URL/db1\",\"target\":\"db5\",\"create_target\":true}

            " &
            wait

          Show
          Robert Newson added a comment - I can reproduce at least one problem with replication with the script below. If executed serially, the tasks complete, but if issued in parallel, they do not; #!/bin/sh URL= http://localhost:5984 dd if=/dev/zero of=att bs=60k count=1 curl -X DELETE $URL/db1 curl -X DELETE $URL/db2 curl -X DELETE $URL/db3 curl -X DELETE $URL/db4 curl -X DELETE $URL/db5 curl -X PUT $URL/db1 for COUNT in {0..100000} do curl -s -X PUT -d ' {"text":"some text goes here, oh yes."} ' $URL/db1/doc$((COUNT * 2)) > /dev/null curl -s -H "Expect: " -X PUT --data-binary @att $URL/db1/doc$((COUNT * 2 + 1))/att > /dev/null done At least one of these tasks will fail. curl http://localhost:5984/_replicate -d " {\"source\":\"$URL/db1\",\"target\":\"db2\",\"create_target\":true} " & curl http://localhost:5984/_replicate -d " {\"source\":\"$URL/db1\",\"target\":\"db3\",\"create_target\":true} " & curl http://localhost:5984/_replicate -d " {\"source\":\"$URL/db1\",\"target\":\"db4\",\"create_target\":true} " & curl http://localhost:5984/_replicate -d " {\"source\":\"$URL/db1\",\"target\":\"db5\",\"create_target\":true} " & wait
          Hide
          Robert Newson added a comment -

          fy: current trunk fails with;

          {"error":"error","reason":"req_timedout"} {"error":"error","reason":"req_timedout"} {"error":"error","reason":"req_timedout"} {"error":"error","reason":"req_timedout"}
          Show
          Robert Newson added a comment - fy: current trunk fails with; {"error":"error","reason":"req_timedout"} {"error":"error","reason":"req_timedout"} {"error":"error","reason":"req_timedout"} {"error":"error","reason":"req_timedout"}
          Hide
          Paul Joseph Davis added a comment -

          Does this produce the

          {#Ref<...>, 3}

          error messages as well?

          Show
          Paul Joseph Davis added a comment - Does this produce the {#Ref<...>, 3} error messages as well?
          Hide
          Robert Newson added a comment -


          It doesn't seem to, no.

          Show
          Robert Newson added a comment - It doesn't seem to, no.
          Hide
          Robert Newson added a comment -

          well, on my production class server, the same test causes this in the log output;

          [Fri, 18 Dec 2009 17:14:43 GMT] [error] [<0.30357.50>] ** Generic server <0.30357.50> terminating

            • Last message in was {#Ref<0.0.370.222135>,3}
              ** When Server state == {state,<0.30968.50>,<0.31048.50>,<0.31067.50>,
              <0.31124.50>,

              So is "{#Ref<0.0.370.222135>,3}

              " evidence of the same phenomenon?

          Show
          Robert Newson added a comment - well, on my production class server, the same test causes this in the log output; [Fri, 18 Dec 2009 17:14:43 GMT] [error] [<0.30357.50>] ** Generic server <0.30357.50> terminating Last message in was {#Ref<0.0.370.222135>,3} ** When Server state == {state,<0.30968.50>,<0.31048.50>,<0.31067.50>, <0.31124.50>, So is "{#Ref<0.0.370.222135>,3} " evidence of the same phenomenon?
          Hide
          Paul Joseph Davis added a comment -

          No idea. I'll try and reproduce when I get home in a bit.

          Show
          Paul Joseph Davis added a comment - No idea. I'll try and reproduce when I get home in a bit.
          Hide
          Adam Kocoloski added a comment -

          Hi Robert, I can reproduce the crashes locally and I've discovered why they happen independently of the

          {ref(), integer()}

          problem. The basic issue is that attachment downloads do not employ the same retry checks that we do for regular document GETs. For instance, the attachment receiver process associated with a replication would be waiting an infinite amount for response headers, when in fact it had an error message in its mailbox informing it that the request had failed. Eventually the changes feed times out and the replication crashes.

          If I apply http://friendpaste.com/5IA5MlRx0OZhKmsLNPMeJe, crank up the changes feed timeout, and add the catchall handle_infos we've talked about before I can successfully run the script you posted here. We have more work to do, though, namely

          1) Reworking the changes feed timeout. Currently it will trigger if there is no activity for X milliseconds on the connection handling the changes feed. There are situations where this is actually normal, since the changes feed consumer is responsible for controlling the socket, and if the target is _really slow (or the documents are huge) it's quite possible that the changes feed will not be consulted for a long time. I think the solution is to handle inactivity timeouts in couch_rep_changes_feed.erl instead of in the underlying ibrowse system.

          2a) Attachment retry logic that handles redirects and limits the number of retries. Basically, the same code as we have in couch_rep_httpc, but only applied until we receive the response headers. My friendpaste above is a primitive form of what I'd ultimately like to see here.

          2b) When an attachment body download has started and then fails, we can't simply retry it. We need to do a Range request or find another way to skip the first N bytes of the retry. Currently we just give up on the entire replication if an attachment request ever fails mid-download.

          Show
          Adam Kocoloski added a comment - Hi Robert, I can reproduce the crashes locally and I've discovered why they happen independently of the {ref(), integer()} problem. The basic issue is that attachment downloads do not employ the same retry checks that we do for regular document GETs. For instance, the attachment receiver process associated with a replication would be waiting an infinite amount for response headers, when in fact it had an error message in its mailbox informing it that the request had failed. Eventually the changes feed times out and the replication crashes. If I apply http://friendpaste.com/5IA5MlRx0OZhKmsLNPMeJe , crank up the changes feed timeout, and add the catchall handle_infos we've talked about before I can successfully run the script you posted here. We have more work to do, though, namely 1) Reworking the changes feed timeout. Currently it will trigger if there is no activity for X milliseconds on the connection handling the changes feed. There are situations where this is actually normal, since the changes feed consumer is responsible for controlling the socket, and if the target is _really slow (or the documents are huge) it's quite possible that the changes feed will not be consulted for a long time. I think the solution is to handle inactivity timeouts in couch_rep_changes_feed.erl instead of in the underlying ibrowse system. 2a) Attachment retry logic that handles redirects and limits the number of retries. Basically, the same code as we have in couch_rep_httpc, but only applied until we receive the response headers. My friendpaste above is a primitive form of what I'd ultimately like to see here. 2b) When an attachment body download has started and then fails, we can't simply retry it. We need to do a Range request or find another way to skip the first N bytes of the retry. Currently we just give up on the entire replication if an attachment request ever fails mid-download.
          Hide
          Adam Kocoloski added a comment -

          Also, I don't really understand why the "after" clause is necessary in that paste. I tried adding a connect_timeout to ibrowse but didn't get any conn_failed messages. It really does seem like a connection is made but then the request just stalls. I suppose it's possible that a connection took 9 seconds (e.g. 3 consecutive TCP retransmits), and then CouchDB took more than 1 second to respond with the headers. Seems unlikely, though. It makes me think we need to add this "after' clause to couch_rep_httpc too.

          Show
          Adam Kocoloski added a comment - Also, I don't really understand why the "after" clause is necessary in that paste. I tried adding a connect_timeout to ibrowse but didn't get any conn_failed messages. It really does seem like a connection is made but then the request just stalls. I suppose it's possible that a connection took 9 seconds (e.g. 3 consecutive TCP retransmits), and then CouchDB took more than 1 second to respond with the headers. Seems unlikely, though. It makes me think we need to add this "after' clause to couch_rep_httpc too.
          Hide
          Randall Leeds added a comment -

          Seems totally possible to me that it would take CouchDB more than a second to respond with headers. I've seen it take CouchDB more than a second to respond to / with the welcome message when load is high enough.

          If attachments are downloaded over the same connection as documents I don't really see how we can get any better than cranking the timeout and praying it's enough for most cases. If attachments had their own connection then the primary changes feed could heartbeat continuous replications, but I don't like this solution much.

          Is there any reason a receiver needs a timeout at all for this? If the source process dies or CouchDB dies entirely on the other end we should get an error reading from the socket. If we don't, who are we to assume that the source isn't just slow? Doesn't seem like there should be a timeout at all to me.

          Show
          Randall Leeds added a comment - Seems totally possible to me that it would take CouchDB more than a second to respond with headers. I've seen it take CouchDB more than a second to respond to / with the welcome message when load is high enough. If attachments are downloaded over the same connection as documents I don't really see how we can get any better than cranking the timeout and praying it's enough for most cases. If attachments had their own connection then the primary changes feed could heartbeat continuous replications, but I don't like this solution much. Is there any reason a receiver needs a timeout at all for this? If the source process dies or CouchDB dies entirely on the other end we should get an error reading from the socket. If we don't, who are we to assume that the source isn't just slow? Doesn't seem like there should be a timeout at all to me.
          Hide
          Randall Leeds added a comment -

          Thinking some more I remember that the internet is difficult.

          +1 on keeping the timeout. I've never actually failed to replicate something due to the timeout if I retry enough times, but sealing of the #Ref leak or catching unexpected messages seems like a Really Import Thing to do. My 2c.

          Show
          Randall Leeds added a comment - Thinking some more I remember that the internet is difficult. +1 on keeping the timeout. I've never actually failed to replicate something due to the timeout if I retry enough times, but sealing of the #Ref leak or catching unexpected messages seems like a Really Import Thing to do. My 2c.
          Hide
          Randall Leeds added a comment -

          I'm also seeing a similar function_clause with couch_view:handle_info receiving a #Ref.

          However, in this case it doesn't crash a replication or the server, but seems to throw the server into a state where it is running but cannot respond to connections at all. Maybe something else is crashing as well. In any case, this isn't cool.

          Show
          Randall Leeds added a comment - I'm also seeing a similar function_clause with couch_view:handle_info receiving a #Ref. However, in this case it doesn't crash a replication or the server, but seems to throw the server into a state where it is running but cannot respond to connections at all. Maybe something else is crashing as well. In any case, this isn't cool.
          Hide
          Randall Leeds added a comment -

          I believe this patch fixes most of the problems we're seeing here.

          The solution, as discussed, is to remove the inactivity_timeout from options passed to ibrowse and handle timeouts manually (here using the timer module).

          In my testing, I could mostly reproduce timeouts caused by not reading data from ibrowse fast enough. In other words, replicating from a remote database was terminating because processing the changes was taking a long time to complete and the socket would be inactive while couch_rep_changes_feed had a full queue of rows. Therefore, a timeout is not set unless the missing revs server is waiting for more changes.

          Timeouts should still occur if the socket is idle and the local queue of received changes is empty. Errors should be caught appropriately such that real problems still bubble.

          I implemented retry logic for attachments in a manner similar to couch_rep_httpc. I had to add some after statements now that the inactivity_timeout is not set.

          The patch applies cleanly to trunk and 0.11.x, so please review!!! I think this would be a very good patch to get into 0.11 so long as Noah hasn't built the artifacts yet.

          Show
          Randall Leeds added a comment - I believe this patch fixes most of the problems we're seeing here. The solution, as discussed, is to remove the inactivity_timeout from options passed to ibrowse and handle timeouts manually (here using the timer module). In my testing, I could mostly reproduce timeouts caused by not reading data from ibrowse fast enough. In other words, replicating from a remote database was terminating because processing the changes was taking a long time to complete and the socket would be inactive while couch_rep_changes_feed had a full queue of rows. Therefore, a timeout is not set unless the missing revs server is waiting for more changes. Timeouts should still occur if the socket is idle and the local queue of received changes is empty. Errors should be caught appropriately such that real problems still bubble. I implemented retry logic for attachments in a manner similar to couch_rep_httpc. I had to add some after statements now that the inactivity_timeout is not set. The patch applies cleanly to trunk and 0.11.x, so please review!!! I think this would be a very good patch to get into 0.11 so long as Noah hasn't built the artifacts yet.
          Hide
          Randall Leeds added a comment -

          I went back and made this patch SUPER simple and straightforward.
          Applies to the very most current trunk.
          This should take no more than a minute to review; it's super simple now.

          Show
          Randall Leeds added a comment - I went back and made this patch SUPER simple and straightforward. Applies to the very most current trunk. This should take no more than a minute to review; it's super simple now.
          Hide
          Randall Leeds added a comment -

          Forgot to check the inclusion box.

          Show
          Randall Leeds added a comment - Forgot to check the inclusion box.
          Hide
          Jan Lehnardt added a comment -

          Applied in trunk in r916868 and 0.11.x in r916869.

          Show
          Jan Lehnardt added a comment - Applied in trunk in r916868 and 0.11.x in r916869.
          Hide
          Randall Leeds added a comment -

          Corrects problems with continuous replication timeouts introduced by r916518 and r916868.

          Show
          Randall Leeds added a comment - Corrects problems with continuous replication timeouts introduced by r916518 and r916868.
          Hide
          Randall Leeds added a comment -

          I just wanted to tie up some loose ends here.

          I had added the after clauses, but they're unnecessary. The original fix takes out the inactivity timeout on the #http_db in couch_rep_changes_feed but couch_rep_att still inherits the default configuration from when it's created by couch_rep (default 30 seconds in couch_db.hrl).

          I forgot to do exponential backoff, so that's added here.

          Retry for attachments should not be only for timeout errors. Safter to do it for whatever error.

          Show
          Randall Leeds added a comment - I just wanted to tie up some loose ends here. I had added the after clauses, but they're unnecessary. The original fix takes out the inactivity timeout on the #http_db in couch_rep_changes_feed but couch_rep_att still inherits the default configuration from when it's created by couch_rep (default 30 seconds in couch_db.hrl). I forgot to do exponential backoff, so that's added here. Retry for attachments should not be only for timeout errors. Safter to do it for whatever error.
          Hide
          Randall Leeds added a comment -

          Re-opening. Still happening on 0.11, but for different reasons.

          Germain reports this from his log on the user@ list:

          [Fri, 26 Mar 2010 09:55:01 GMT] [debug] [<0.2466.0>] retrying couch_rep_httpc post request in 16.0 seconds due to

          {error, req_timedout}
          [Fri, 26 Mar 2010 09:56:13 GMT] [debug] [<0.2466.0>] retrying couch_rep_httpc post request in 32.0 seconds due to {error, req_timedout}

          [Fri, 26 Mar 2010 09:57:42 GMT] [debug] [<0.2466.0>] retrying couch_rep_httpc post request in 64.0 seconds due to

          {error, req_timedout}
          [Fri, 26 Mar 2010 09:59:40 GMT] [debug] [<0.2466.0>] retrying couch_rep_httpc post request in 128.0 seconds due to {error, req_timedout}

          In my experience with this in production, I've seen put requests get stalled here writing the checkpoint document. I'm guessing the log above is the _ensure_full_commit failing (since that's the only post in replication I think). In my logs I see 409 conflicts writing the remote checkpoint document but only timeouts on the receiving side of those conflict PUTs. I'm not sure why the conflict doesn't bubble up to couch_rep. My first guess is that maybe we're not asking ibrowse to stream the next chunk in some code path and the remote side has sent a response that we never retrieve.

          Show
          Randall Leeds added a comment - Re-opening. Still happening on 0.11, but for different reasons. Germain reports this from his log on the user@ list: [Fri, 26 Mar 2010 09:55:01 GMT] [debug] [<0.2466.0>] retrying couch_rep_httpc post request in 16.0 seconds due to {error, req_timedout} [Fri, 26 Mar 2010 09:56:13 GMT] [debug] [<0.2466.0>] retrying couch_rep_httpc post request in 32.0 seconds due to {error, req_timedout} [Fri, 26 Mar 2010 09:57:42 GMT] [debug] [<0.2466.0>] retrying couch_rep_httpc post request in 64.0 seconds due to {error, req_timedout} [Fri, 26 Mar 2010 09:59:40 GMT] [debug] [<0.2466.0>] retrying couch_rep_httpc post request in 128.0 seconds due to {error, req_timedout} In my experience with this in production, I've seen put requests get stalled here writing the checkpoint document. I'm guessing the log above is the _ensure_full_commit failing (since that's the only post in replication I think). In my logs I see 409 conflicts writing the remote checkpoint document but only timeouts on the receiving side of those conflict PUTs. I'm not sure why the conflict doesn't bubble up to couch_rep. My first guess is that maybe we're not asking ibrowse to stream the next chunk in some code path and the remote side has sent a response that we never retrieve.
          Hide
          Randall Leeds added a comment -

          Looks like I can't re-open? Can I? Would someone with commit access mind changing the status of this bug?

          Show
          Randall Leeds added a comment - Looks like I can't re-open? Can I? Would someone with commit access mind changing the status of this bug?
          Hide
          Adam Kocoloski added a comment -

          Couldn't it also be the case that a _bulk_docs POST would take longer than 30 seconds?

          Show
          Adam Kocoloski added a comment - Couldn't it also be the case that a _bulk_docs POST would take longer than 30 seconds?
          Hide
          Randall Leeds added a comment -

          In my case, it's a put request and pull replication, which doesn't use bulk docs.
          Maybe in Germain's case it is a push replication, so it could be bulk docs.

          If it's taking longer than 30 seconds for bulk docs to complete, how should we handle this?
          We could increase the timeout the same way we increase the pause time in the backoff code, but I'm not sure this is the right way to handle it.

          Show
          Randall Leeds added a comment - In my case, it's a put request and pull replication, which doesn't use bulk docs. Maybe in Germain's case it is a push replication, so it could be bulk docs. If it's taking longer than 30 seconds for bulk docs to complete, how should we handle this? We could increase the timeout the same way we increase the pause time in the backoff code, but I'm not sure this is the right way to handle it.
          Hide
          Filipe Manana added a comment -

          Regarding _bulk_docs response timeout,

          perhaps adding an heartbeat option like we do for _changes ?

          Sounds simpler than supplying explicit timeout values for the ibrowse connections.

          cheers

          Show
          Filipe Manana added a comment - Regarding _bulk_docs response timeout, perhaps adding an heartbeat option like we do for _changes ? Sounds simpler than supplying explicit timeout values for the ibrowse connections. cheers
          Hide
          Randall Leeds added a comment -

          I'd have to look at how _bulk_docs works, but if it streams the results back out that should be enough.
          ibrowse won't time out unless there's no activity at all for a contiguous period of <timeout> seconds, so if couch can respond with one doc at a time in a chunked response it'd be ok. Whitespace between inner json objects for each doc should be fine, too, so a heartbeat parameter could work, but it might require some refactoring. I seem to remember couch generating the entire bulk docs response object before sending it back, and that would have to change.

          I think it'd be a "really good idea" to do this though.

          Show
          Randall Leeds added a comment - I'd have to look at how _bulk_docs works, but if it streams the results back out that should be enough. ibrowse won't time out unless there's no activity at all for a contiguous period of <timeout> seconds, so if couch can respond with one doc at a time in a chunked response it'd be ok. Whitespace between inner json objects for each doc should be fine, too, so a heartbeat parameter could work, but it might require some refactoring. I seem to remember couch generating the entire bulk docs response object before sending it back, and that would have to change. I think it'd be a "really good idea" to do this though.
          Hide
          Filipe Manana added a comment -

          It doesn't stream the results.

          Theoretically it should be very simple, just create a process that sends a newline every X ms, and after the call to couch_db:update_docs/4 stop it (by sending a special message to it for e.g.).

          Good luck

          Show
          Filipe Manana added a comment - It doesn't stream the results. Theoretically it should be very simple, just create a process that sends a newline every X ms, and after the call to couch_db:update_docs/4 stop it (by sending a special message to it for e.g.). Good luck
          Hide
          Jan Lehnardt added a comment -

          Bump to 1.3.x

          Show
          Jan Lehnardt added a comment - Bump to 1.3.x
          Hide
          Alexander Shorin added a comment -

          Fixed in 64572be by Jan Lehnardt (:

          Show
          Alexander Shorin added a comment - Fixed in 64572be by Jan Lehnardt (:

            People

            • Assignee:
              Unassigned
              Reporter:
              Robert Newson
            • Votes:
              2 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development