Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Not A Problem
    • Affects Version/s: 0.11
    • Fix Version/s: None
    • Component/s: Replication
    • Labels:
      None
    • Environment:

      Arch Linux, CouchDB 0.11

    • Skill Level:
      Regular Contributors Level (Easy to Medium)

      Description

      Couchdb 0.11.0 replication tasks fails with the below after working for everything from a few minutes to an hour. The below replication is of the type

      {"source":"http://127.0.0.1:5984/node-metrics", "target":"http://1.2.3.4:5984/node-metrics", "continuous":true}

      and the node-metrics database exist on both machines.

      The database is periodically compacted which, and I'm speculating here, could be a contributing factor to the crash.

      Kind regards,
      Fredrik Widlund

      =CRASH REPORT==== 1-Apr-2010::14:25:26 ===
      crasher:
      initial call: couch_rep:init/1
      pid: <0.274.0>
      registered_name: []
      exception exit: {{badmatch,
      {stop,

      {db_not_found, <<"http://127.0.0.1:5984/node-metrics/">>}

      }},
      [

      {couch_rep,do_checkpoint,1},
      {couch_rep,handle_cast,2},
      {gen_server,handle_msg,5},
      {proc_lib,init_p_do_apply,3}]}
      in function gen_server:terminate/6
      ancestors: [couch_rep_sup,couch_primary_services,couch_server_sup,
      <0.32.0>]
      messages: [{'EXIT',<0.21084.1>,normal}]
      links: [<0.81.0>]
      dictionary: [{task_status_update,{{1270,124726,124009},0}}]
      trap_exit: true
      status: running
      heap_size: 10946
      stack_size: 24
      reductions: 29173458
      neighbours:
      [error] [<0.81.0>] {error_report,<0.31.0>,
      {<0.81.0>,supervisor_report,
      [{supervisor,{local,couch_rep_sup}},
      {errorContext,child_terminated},
      {reason,
      {{badmatch,
      {stop,
      {db_not_found,<<"http://127.0.0.1:5984/node-metrics/">>}}},
      [{couch_rep,do_checkpoint,1}

      ,

      {couch_rep,handle_cast,2},
      {gen_server,handle_msg,5},
      {proc_lib,init_p_do_apply,3}]}},
      {offender,
      [{pid,<0.274.0>},
      {name,"f3e3081db5a215dbaf9b2984f0552090+continuous"},
      {mfa,
      {gen_server,start_link,
      [couch_rep,
      ["f3e3081db5a215dbaf9b2984f0552090",
      {[{<<"target">>, <<"http://1.2.3.4:5984/node-metrics">>},
      {<<"source">>,<<"http://127.0.0.1:5984/node-metrics">>},
      {<<"continuous">>,true}]},
      {user_ctx,null,
      [<<"_admin">>],
      <<"{couch_httpd_auth, default_authentication_handler}">>}],
      []]}},
      {restart_type,temporary},
      {shutdown,1},
      {child_type,worker}]}]}}

      =SUPERVISOR REPORT==== 1-Apr-2010::14:25:26 ===
      Supervisor: {local,couch_rep_sup}
      Context: child_terminated
      Reason: {{badmatch,
      {stop,
      {db_not_found, <<"http://127.0.0.1:5984/node-metrics/">>}}},
      [{couch_rep,do_checkpoint,1},
      {couch_rep,handle_cast,2}

      ,

      {gen_server,handle_msg,5}

      ,

      {proc_lib,init_p_do_apply,3}

      ]}
      Offender: [

      {pid,<0.274.0>}

      ,

      {name,"f3e3081db5a215dbaf9b2984f0552090+continuous"}

      ,
      {mfa,
      {gen_server,start_link,
      [couch_rep,
      ["f3e3081db5a215dbaf9b2984f0552090",
      {[

      {<<"target">>, <<"http://1.2.3.4:5984/node-metrics">>}

      ,

      {<<"source">>, <<"http://127.0.0.1:5984/node-metrics">>}

      ,

      {<<"continuous">>,true}

      ]},
      {user_ctx,null,
      [<<"_admin">>],
      <<"

      {couch_httpd_auth, default_authentication_handler}

      ">>}],
      []]}},

      {restart_type,temporary}

      ,

      {shutdown,1}

      ,

      {child_type,worker}

      ]

        Issue Links

          Activity

          Hide
          Jan Lehnardt added a comment -

          Please reopen if this still happens with 1.1.x and/or master

          Show
          Jan Lehnardt added a comment - Please reopen if this still happens with 1.1.x and/or master
          Hide
          Adam Kocoloski added a comment -

          Created COUCHDB-744 for the instance_start_time thing.

          Show
          Adam Kocoloski added a comment - Created COUCHDB-744 for the instance_start_time thing.
          Hide
          Adam Kocoloski added a comment -

          I'm not sure why the instance_start_time needs to change. I think we could keep the old instance_start_time on the compaction switchover and thus avoid the reboot altogether.

          Show
          Adam Kocoloski added a comment - I'm not sure why the instance_start_time needs to change. I think we could keep the old instance_start_time on the compaction switchover and thus avoid the reboot altogether.
          Hide
          Adam Kocoloski added a comment -

          So, the db_not_found messages are unfortunately masking the real error in this case. If we look at couch_rep_httpc:db_exists, we see that its catching every class of error, logging it at debug level, and then throwing the db_not_found exception.

          Randall, I'm guessing your fix for COUCHDB-730 will take care of this ticket. Consider the following sequence of events:

          1) user initiates replication, replicator gets an instance_start_time
          2) database compaction completes, instance_start_time changes
          3) replicator calls _ensure_full_commit, discovers that start time has changed, tries to reboot
          4) ibrowse call in db_exists fails with

          {error, connection_closed}

          b/c of the bug in COUCHDB-730.
          5) couch_rep_httpc throws db_not_found

          Fredrik, you might try

          a) checking the actual reason for the db_not_found error, either by logging at debug level or by editing the code and changing the message in couch_rep_httpc:db_exists/3 to a ?LOG_INFO or ?LOG_ERROR

          b) applying the patch from COUCHDB-730. It's already on the 0.11.x branch if you want to simply build from there.

          Show
          Adam Kocoloski added a comment - So, the db_not_found messages are unfortunately masking the real error in this case. If we look at couch_rep_httpc:db_exists, we see that its catching every class of error, logging it at debug level, and then throwing the db_not_found exception. Randall, I'm guessing your fix for COUCHDB-730 will take care of this ticket. Consider the following sequence of events: 1) user initiates replication, replicator gets an instance_start_time 2) database compaction completes, instance_start_time changes 3) replicator calls _ensure_full_commit, discovers that start time has changed, tries to reboot 4) ibrowse call in db_exists fails with {error, connection_closed} b/c of the bug in COUCHDB-730 . 5) couch_rep_httpc throws db_not_found Fredrik, you might try a) checking the actual reason for the db_not_found error, either by logging at debug level or by editing the code and changing the message in couch_rep_httpc:db_exists/3 to a ?LOG_INFO or ?LOG_ERROR b) applying the patch from COUCHDB-730 . It's already on the 0.11.x branch if you want to simply build from there.
          Hide
          Randall Leeds added a comment -

          This patch does not solve the issue of replication crashing, or really make any attempt to.

          The patch causes replication to restart by calling couch_rep:do_init/1 instead of couch_rep:init/1. couch_rep:init/1 call traps the db_not_found exception so it can return a stop tuple and abort the gen_server. Instead, letting this bubble up to couch_httpd_misc_handlers should generate an appropriate error response to the client and avoid the incomprehensible badmatch nonsense.

          *This patch should be applied but does NOT close the issue*

          I'm continuing to investigate.

          Show
          Randall Leeds added a comment - This patch does not solve the issue of replication crashing, or really make any attempt to. The patch causes replication to restart by calling couch_rep:do_init/1 instead of couch_rep:init/1. couch_rep:init/1 call traps the db_not_found exception so it can return a stop tuple and abort the gen_server. Instead, letting this bubble up to couch_httpd_misc_handlers should generate an appropriate error response to the client and avoid the incomprehensible badmatch nonsense. * This patch should be applied but does NOT close the issue * I'm continuing to investigate.
          Hide
          Fredrik Widlund added a comment -

          A grep collection of crashes, if it's helpful.

          [root@db3 scripts]# grep -B2 -A2 -E "[error] .*terminating" couchdb.stdout
          [info] [<0.6318.0>] 127.0.0.1 - - 'POST' /node-metrics/_ensure_full_commit?seq=69308 201
          [info] [<0.291.0>] rebooting http://127.0.0.1:5984/node-metrics/ -> http://1.2.3.5:5984/node-metrics/ from last known replication\
          checkpoint
          [error] [<0.291.0>] ** Generic server <0.291.0> terminating

            • Last message in was {'$gen_cast',do_checkpoint}
              ** When Server state == {state,<0.6763.0>,<0.6767.0>,<0.6770.0>,<0.6772.0>,

              [info] [<0.31361.1>] 127.0.0.1 - - 'POST' /service-metrics/_ensure_full_commit?seq=98608 201
              [info] [<0.273.0>] rebooting http://127.0.0.1:5984/service-metrics/ -> http://1.2.3.5:5984/service-metrics/ from last known repli\
              cation checkpoint
              [error] [<0.273.0>] ** Generic server <0.273.0> terminating
              ** Last message in was {'$gen_cast',do_checkpoint}
            • When Server state == {state,<0.31620.1>,<0.31625.1>,<0.31627.1>,

              [info] [<0.24154.5>] 127.0.0.1 - - 'POST' /node-metrics/_ensure_full_commit?seq=230868 201
              [info] [<0.15120.5>] rebooting http://127.0.0.1:5984/node-metrics/ -> http://1.2.3.5:5984/node-metrics/ from last known replicati\
              on checkpoint
              [error] [<0.15120.5>] ** Generic server <0.15120.5> terminating
            • Last message in was {'$gen_cast',do_checkpoint}
              ** When Server state == {state,<0.24125.5>,<0.24129.5>,<0.24132.5>,

              [info] [<0.4380.7>] 127.0.0.1 - - 'POST' /node-metrics/_ensure_full_commit?seq=248027 201
              [info] [<0.3606.7>] rebooting http://127.0.0.1:5984/node-metrics/ -> http://1.2.3.5:5984/node-metrics/ from last known replicatio\
              n checkpoint
              [error] [<0.3606.7>] ** Generic server <0.3606.7> terminating
              ** Last message in was {'$gen_cast',do_checkpoint}
            • When Server state == {state,<0.4317.7>,<0.4322.7>,<0.4324.7>,<0.4326.7>,

              [info] [<0.15414.7>] 127.0.0.1 - - 'POST' /service-metrics/_ensure_full_commit?seq=231731 201
              [info] [<0.15142.5>] rebooting http://127.0.0.1:5984/service-metrics/ -> http://1.2.3.5:5984/service-metrics/ from last known rep\
              lication checkpoint
              [error] [<0.15142.5>] ** Generic server <0.15142.5> terminating
            • Last message in was {'$gen_cast',do_checkpoint}
              ** When Server state == {state,<0.15516.7>,<0.15521.7>,<0.15523.7>,

              [info] [<0.26905.7>] 127.0.0.1 - - 'POST' /node-metrics/_ensure_full_commit?seq=255490 201
              [info] [<0.16250.7>] rebooting http://127.0.0.1:5984/node-metrics/ -> http://1.2.3.5:5984/node-metrics/ from last known replicati\
              on checkpoint
              [error] [<0.16250.7>] ** Generic server <0.16250.7> terminating
              ** Last message in was {'$gen_cast',do_checkpoint}
            • When Server state == {state,<0.27125.7>,<0.27129.7>,<0.27132.7>,

              [info] [<0.8487.8>] 127.0.0.1 - - 'POST' /service-metrics/_ensure_full_commit?seq=240461 201
              [info] [<0.16228.7>] rebooting http://127.0.0.1:5984/service-metrics/ -> http://1.2.3.5:5984/service-metrics/ from last known rep\
              lication checkpoint
              [error] [<0.16228.7>] ** Generic server <0.16228.7> terminating
            • Last message in was {'$gen_cast',do_checkpoint}
              ** When Server state == {state,<0.8531.8>,<0.8536.8>,<0.8538.8>,<0.8540.8>,

              [info] [<0.15483.8>] 127.0.0.1 - - 'POST' /service-metrics/_ensure_full_commit?seq=247246 201
              [info] [<0.15504.8>] rebooting http://127.0.0.1:5984/service-metrics/ -> http://1.2.3.5:5984/service-metrics/ from last known rep\
              lication checkpoint
              [error] [<0.15504.8>] ** Generic server <0.15504.8> terminating
              ** Last message in was {'$gen_cast',do_checkpoint}
            • When Server state == {state,<0.15557.8>,<0.15563.8>,<0.15567.8>,

              [info] [<0.15481.8>] rebooting http://127.0.0.1:5984/node-metrics/ -> http://1.2.3.5:5984/node-metrics/ from last known replicati\
              On checkpoint
              [info] [<0.16982.8>] 1.2.3.5 - - 'POST' /node-metrics/_ensure_full_commit 201
              [error] [<0.15481.8>] ** Generic server <0.15481.8> terminating
            • Last message in was {'$gen_cast',do_checkpoint}
              ** When Server state == {state,<0.16926.8>,<0.16930.8>,<0.16933.8>,

              [info] [<0.20255.8>] 127.0.0.1 - - 'POST' /node-metrics/_ensure_full_commit?seq=269770 201
              [info] [<0.18127.8>] rebooting http://127.0.0.1:5984/node-metrics/ -> http://1.2.3.5:5984/node-metrics/ from last known replicati\
              on checkpoint
              [error] [<0.18127.8>] ** Generic server <0.18127.8> terminating
              ** Last message in was {'$gen_cast',do_checkpoint}
            • When Server state == {state,<0.20451.8>,<0.20455.8>,<0.20458.8>,

              [info] [<0.30782.8>] 127.0.0.1 - - 'POST' /node-metrics/_ensure_full_commit?seq=272628 201
              [info] [<0.22327.8>] rebooting http://127.0.0.1:5984/node-metrics/ -> http://1.2.3.5:5984/node-metrics/ from last known replicati\
              on checkpoint
              [error] [<0.22327.8>] ** Generic server <0.22327.8> terminating
            • Last message in was {'$gen_cast',do_checkpoint}
              ** When Server state == {state,<0.30991.8>,<0.30995.8>,<0.30998.8>,

              [info] [<0.20666.1>] 127.0.0.1 - - 'POST' /node-metrics/_ensure_full_commit?seq=288432 201
              [info] [<0.274.0>] rebooting http://127.0.0.1:5984/node-metrics/ -> http://1.2.3.5:5984/node-metrics/ from last known replication\
              checkpoint
              [error] [<0.274.0>] ** Generic server <0.274.0> terminating
              ** Last message in was {'$gen_cast',do_checkpoint}
            • When Server state == {state,<0.20538.1>,<0.20542.1>,<0.20545.1>,

              [info] [<0.28001.2>] 127.0.0.1 - - 'POST' /node-metrics/_ensure_full_commit?seq=295892 201
              [info] [<0.21122.1>] rebooting http://127.0.0.1:5984/node-metrics/ -> http://1.2.3.5:5984/node-metrics/ from last known replicati\
              on checkpoint
              [error] [<0.21122.1>] ** Generic server <0.21122.1> terminating
            • Last message in was {'$gen_cast',do_checkpoint}
              ** When Server state == {state,<0.27919.2>,<0.27923.2>,<0.27926.2>,

              [info] [<0.16437.3>] 1.2.3.4 - - 'GET' /service-metrics/_design/views/_view/allmetrics 200
              [error] [<0.256.0>] couch_rep_httpc request failed after 10 retries: http://1.2.3.5:5984/service-metrics/
              [error] [<0.256.0>] ** Generic server <0.256.0> terminating
              ** Last message in was {'$gen_cast',do_checkpoint}
            • When Server state == {state,<0.15441.3>,<0.15446.3>,<0.15448.3>,

              reductions: 4021
              neighbours:
              [error] [<0.15448.3>] ** Generic server <0.15448.3> terminating
            • Last message in was {'EXIT',<0.15449.3>,
              {{http_request_failed,

              {child_type,worker}]

              [error] [<0.15441.3>] ** Generic server <0.15441.3> terminating
              ** Last message in was {'EXIT',<0.256.0>,
              {http_request_failed,

              neighbours:
              [error] [<0.9022.3>] couch_rep_httpc request failed after 10 retries: http://1.2.3.5:5984/node-metrics/
              [error] [<0.9022.3>] ** Generic server <0.9022.3> terminating
              ** Last message in was {'$gen_cast',do_checkpoint}
              ** When Server state == {state,<0.15170.3>,<0.15174.3>,<0.15177.3>,

              {child_type,worker}

              ]

          [error] [<0.15174.3>] ** Generic server <0.15174.3> terminating

            • Last message in was {'EXIT',<0.9022.3>,
              {http_request_failed,

              reductions: 7005
              neighbours:
              [error] [<0.15177.3>] ** Generic server <0.15177.3> terminating
            • Last message in was {'EXIT',<0.9022.3>,
              {http_request_failed,

              {stack_size,15}

              ,

              {reductions,5200}

              ]
              [error] [<0.15170.3>] ** Generic server <0.15170.3> terminating

            • Last message in was {'EXIT',<0.9022.3>,
              {http_request_failed,

              [info] [<0.31343.3>] 127.0.0.1 - - 'POST' /service-metrics/_ensure_full_commit?seq=292618 201
              [info] [<0.18230.3>] rebooting http://127.0.0.1:5984/service-metrics/ -> http://1.2.3.5:5984/service-metrics/ from last known rep\
              lication checkpoint
              [error] [<0.18230.3>] ** Generic server <0.18230.3> terminating
            • Last message in was {'$gen_cast',do_checkpoint}
            • When Server state == {state,<0.31889.3>,<0.31894.3>,<0.31896.3>,

          Fredrik Widlund, CSO / Chief Architect, Qbrick
          Direct: +46 8 459 90 32 | Mobile: +46 76 899 96 66

          Södra Hamnvägen 22 | 115 41 STOCKHOLM
          Web and mobile: www.qbrick.com

          ----Ursprungligt meddelande----
          Från: Randall Leeds (JIRA) jira@apache.org
          Skickat: den 1 april 2010 21:12
          Till: Fredrik Widlund
          Ämne: [jira] Commented: (COUCHDB-722) Continuous replication tasks fail

          [ https://issues.apache.org/jira/browse/COUCHDB-722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852510#action_12852510 ]

          Randall Leeds commented on COUCHDB-722:
          ---------------------------------------

          I'm rather confused.

          The compaction seems to be on the service-metrics database, but the replication is between databases named node-metrics.
          However, there's a POST to /service-metrics/_missing_revs on the target database right around the time compaction completes. Replication performs this operation. Are you using vhosts or some kind of proxy layer that's rewriting any of your requests? Could you include a little bit more context at the end where you put the ...? In particular I want to know if the replication was using the service-metrics database at all.


          This message is automatically generated by JIRA.
          -
          You can reply to this email to add a comment to the issue online.

          Show
          Fredrik Widlund added a comment - A grep collection of crashes, if it's helpful. [root@db3 scripts] # grep -B2 -A2 -E "[error] .*terminating" couchdb.stdout [info] [<0.6318.0>] 127.0.0.1 - - 'POST' /node-metrics/_ensure_full_commit?seq=69308 201 [info] [<0.291.0>] rebooting http://127.0.0.1:5984/node-metrics/ -> http://1.2.3.5:5984/node-metrics/ from last known replication\ checkpoint [error] [<0.291.0>] ** Generic server <0.291.0> terminating Last message in was {'$gen_cast',do_checkpoint} ** When Server state == {state,<0.6763.0>,<0.6767.0>,<0.6770.0>,<0.6772.0>, – [info] [<0.31361.1>] 127.0.0.1 - - 'POST' /service-metrics/_ensure_full_commit?seq=98608 201 [info] [<0.273.0>] rebooting http://127.0.0.1:5984/service-metrics/ -> http://1.2.3.5:5984/service-metrics/ from last known repli\ cation checkpoint [error] [<0.273.0>] ** Generic server <0.273.0> terminating ** Last message in was {'$gen_cast',do_checkpoint} When Server state == {state,<0.31620.1>,<0.31625.1>,<0.31627.1>, – [info] [<0.24154.5>] 127.0.0.1 - - 'POST' /node-metrics/_ensure_full_commit?seq=230868 201 [info] [<0.15120.5>] rebooting http://127.0.0.1:5984/node-metrics/ -> http://1.2.3.5:5984/node-metrics/ from last known replicati\ on checkpoint [error] [<0.15120.5>] ** Generic server <0.15120.5> terminating Last message in was {'$gen_cast',do_checkpoint} ** When Server state == {state,<0.24125.5>,<0.24129.5>,<0.24132.5>, – [info] [<0.4380.7>] 127.0.0.1 - - 'POST' /node-metrics/_ensure_full_commit?seq=248027 201 [info] [<0.3606.7>] rebooting http://127.0.0.1:5984/node-metrics/ -> http://1.2.3.5:5984/node-metrics/ from last known replicatio\ n checkpoint [error] [<0.3606.7>] ** Generic server <0.3606.7> terminating ** Last message in was {'$gen_cast',do_checkpoint} When Server state == {state,<0.4317.7>,<0.4322.7>,<0.4324.7>,<0.4326.7>, – [info] [<0.15414.7>] 127.0.0.1 - - 'POST' /service-metrics/_ensure_full_commit?seq=231731 201 [info] [<0.15142.5>] rebooting http://127.0.0.1:5984/service-metrics/ -> http://1.2.3.5:5984/service-metrics/ from last known rep\ lication checkpoint [error] [<0.15142.5>] ** Generic server <0.15142.5> terminating Last message in was {'$gen_cast',do_checkpoint} ** When Server state == {state,<0.15516.7>,<0.15521.7>,<0.15523.7>, – [info] [<0.26905.7>] 127.0.0.1 - - 'POST' /node-metrics/_ensure_full_commit?seq=255490 201 [info] [<0.16250.7>] rebooting http://127.0.0.1:5984/node-metrics/ -> http://1.2.3.5:5984/node-metrics/ from last known replicati\ on checkpoint [error] [<0.16250.7>] ** Generic server <0.16250.7> terminating ** Last message in was {'$gen_cast',do_checkpoint} When Server state == {state,<0.27125.7>,<0.27129.7>,<0.27132.7>, – [info] [<0.8487.8>] 127.0.0.1 - - 'POST' /service-metrics/_ensure_full_commit?seq=240461 201 [info] [<0.16228.7>] rebooting http://127.0.0.1:5984/service-metrics/ -> http://1.2.3.5:5984/service-metrics/ from last known rep\ lication checkpoint [error] [<0.16228.7>] ** Generic server <0.16228.7> terminating Last message in was {'$gen_cast',do_checkpoint} ** When Server state == {state,<0.8531.8>,<0.8536.8>,<0.8538.8>,<0.8540.8>, – [info] [<0.15483.8>] 127.0.0.1 - - 'POST' /service-metrics/_ensure_full_commit?seq=247246 201 [info] [<0.15504.8>] rebooting http://127.0.0.1:5984/service-metrics/ -> http://1.2.3.5:5984/service-metrics/ from last known rep\ lication checkpoint [error] [<0.15504.8>] ** Generic server <0.15504.8> terminating ** Last message in was {'$gen_cast',do_checkpoint} When Server state == {state,<0.15557.8>,<0.15563.8>,<0.15567.8>, – [info] [<0.15481.8>] rebooting http://127.0.0.1:5984/node-metrics/ -> http://1.2.3.5:5984/node-metrics/ from last known replicati\ On checkpoint [info] [<0.16982.8>] 1.2.3.5 - - 'POST' /node-metrics/_ensure_full_commit 201 [error] [<0.15481.8>] ** Generic server <0.15481.8> terminating Last message in was {'$gen_cast',do_checkpoint} ** When Server state == {state,<0.16926.8>,<0.16930.8>,<0.16933.8>, – [info] [<0.20255.8>] 127.0.0.1 - - 'POST' /node-metrics/_ensure_full_commit?seq=269770 201 [info] [<0.18127.8>] rebooting http://127.0.0.1:5984/node-metrics/ -> http://1.2.3.5:5984/node-metrics/ from last known replicati\ on checkpoint [error] [<0.18127.8>] ** Generic server <0.18127.8> terminating ** Last message in was {'$gen_cast',do_checkpoint} When Server state == {state,<0.20451.8>,<0.20455.8>,<0.20458.8>, – [info] [<0.30782.8>] 127.0.0.1 - - 'POST' /node-metrics/_ensure_full_commit?seq=272628 201 [info] [<0.22327.8>] rebooting http://127.0.0.1:5984/node-metrics/ -> http://1.2.3.5:5984/node-metrics/ from last known replicati\ on checkpoint [error] [<0.22327.8>] ** Generic server <0.22327.8> terminating Last message in was {'$gen_cast',do_checkpoint} ** When Server state == {state,<0.30991.8>,<0.30995.8>,<0.30998.8>, – [info] [<0.20666.1>] 127.0.0.1 - - 'POST' /node-metrics/_ensure_full_commit?seq=288432 201 [info] [<0.274.0>] rebooting http://127.0.0.1:5984/node-metrics/ -> http://1.2.3.5:5984/node-metrics/ from last known replication\ checkpoint [error] [<0.274.0>] ** Generic server <0.274.0> terminating ** Last message in was {'$gen_cast',do_checkpoint} When Server state == {state,<0.20538.1>,<0.20542.1>,<0.20545.1>, – [info] [<0.28001.2>] 127.0.0.1 - - 'POST' /node-metrics/_ensure_full_commit?seq=295892 201 [info] [<0.21122.1>] rebooting http://127.0.0.1:5984/node-metrics/ -> http://1.2.3.5:5984/node-metrics/ from last known replicati\ on checkpoint [error] [<0.21122.1>] ** Generic server <0.21122.1> terminating Last message in was {'$gen_cast',do_checkpoint} ** When Server state == {state,<0.27919.2>,<0.27923.2>,<0.27926.2>, – [info] [<0.16437.3>] 1.2.3.4 - - 'GET' /service-metrics/_design/views/_view/allmetrics 200 [error] [<0.256.0>] couch_rep_httpc request failed after 10 retries: http://1.2.3.5:5984/service-metrics/ [error] [<0.256.0>] ** Generic server <0.256.0> terminating ** Last message in was {'$gen_cast',do_checkpoint} When Server state == {state,<0.15441.3>,<0.15446.3>,<0.15448.3>, – reductions: 4021 neighbours: [error] [<0.15448.3>] ** Generic server <0.15448.3> terminating Last message in was {'EXIT',<0.15449.3>, {{http_request_failed, – {child_type,worker}] [error] [<0.15441.3>] ** Generic server <0.15441.3> terminating ** Last message in was {'EXIT',<0.256.0>, {http_request_failed, – neighbours: [error] [<0.9022.3>] couch_rep_httpc request failed after 10 retries: http://1.2.3.5:5984/node-metrics/ [error] [<0.9022.3>] ** Generic server <0.9022.3> terminating ** Last message in was {'$gen_cast',do_checkpoint} ** When Server state == {state,<0.15170.3>,<0.15174.3>,<0.15177.3>, – {child_type,worker} ] [error] [<0.15174.3>] ** Generic server <0.15174.3> terminating Last message in was {'EXIT',<0.9022.3>, {http_request_failed, – reductions: 7005 neighbours: [error] [<0.15177.3>] ** Generic server <0.15177.3> terminating Last message in was {'EXIT',<0.9022.3>, {http_request_failed, – {stack_size,15} , {reductions,5200} ] [error] [<0.15170.3>] ** Generic server <0.15170.3> terminating Last message in was {'EXIT',<0.9022.3>, {http_request_failed, – [info] [<0.31343.3>] 127.0.0.1 - - 'POST' /service-metrics/_ensure_full_commit?seq=292618 201 [info] [<0.18230.3>] rebooting http://127.0.0.1:5984/service-metrics/ -> http://1.2.3.5:5984/service-metrics/ from last known rep\ lication checkpoint [error] [<0.18230.3>] ** Generic server <0.18230.3> terminating Last message in was {'$gen_cast',do_checkpoint} When Server state == {state,<0.31889.3>,<0.31894.3>,<0.31896.3>, – Fredrik Widlund, CSO / Chief Architect, Qbrick Direct: +46 8 459 90 32 | Mobile: +46 76 899 96 66 Södra Hamnvägen 22 | 115 41 STOCKHOLM Web and mobile: www.qbrick.com ---- Ursprungligt meddelande ---- Från: Randall Leeds (JIRA) jira@apache.org Skickat: den 1 april 2010 21:12 Till: Fredrik Widlund Ämne: [jira] Commented: ( COUCHDB-722 ) Continuous replication tasks fail [ https://issues.apache.org/jira/browse/COUCHDB-722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852510#action_12852510 ] Randall Leeds commented on COUCHDB-722 : --------------------------------------- I'm rather confused. The compaction seems to be on the service-metrics database, but the replication is between databases named node-metrics. However, there's a POST to /service-metrics/_missing_revs on the target database right around the time compaction completes. Replication performs this operation. Are you using vhosts or some kind of proxy layer that's rewriting any of your requests? Could you include a little bit more context at the end where you put the ...? In particular I want to know if the replication was using the service-metrics database at all. – This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
          Hide
          Fredrik Widlund added a comment -

          The service-metrics database is also replicated, to the same target. The couchdb instances are communicating directly to each other without any proxy, rewriting or address translating.

          I'm afraid the entries from the last mail probably was a crash on the opposite instance. The below should be from the same crash as the first one. This crash actually didn’t have the completion of the service-metrics compact directly before it.

          [info] [<0.20666.1>] 127.0.0.1 - - 'POST' /node-metrics/_ensure_full_commit?seq=288432 201
          [info] [<0.274.0>] rebooting http://127.0.0.1:5984/node-metrics/ -> http://1.2.3.5:5984/node-metrics/ from last known repl\
          ication checkpoint
          [error] [<0.274.0>] ** Generic server <0.274.0> terminating

            • Last message in was {'$gen_cast',do_checkpoint}
              ** When Server state == {state,<0.20538.1>,<0.20542.1>,<0.20545.1>,
              <0.20547.1>,
              {http_db,"http://127.0.0.1:5984/node-metrics/",
              [],[],
              [{"User-Agent","CouchDB/0.11.0"},
              {"Accept","application/json"},
              {"Accept-Encoding","gzip"}],
              [],get,nil,
              [{response_format,binary},
              {inactivity_timeout,30000}],
              10,500,nil},
              {http_db,
              "http://1.2.3.5:5984/node-metrics/",[],
              [],
              [{"User-Agent","CouchDB/0.11.0"},
              {"Accept","application/json"},
              {"Accept-Encoding","gzip"}],
              [],get,nil,
              [{response_format,binary},
              {inactivity_timeout,30000}],
              10,500,nil},
              true,false,
              ["f3e3081db5a215dbaf9b2984f0552090",
              {[{<<"target">>, <<"http://1.2.3.5:5984/node-metrics">>},
              {<<"source">>, <<"http://127.0.0.1:5984/node-metrics">>},
              {<<"continuous">>,true}]},
              {user_ctx,null,
              [<<"_admin">>],
              <<"{couch_httpd_auth, default_authentication_handler}">>}],
              {1270124726131655,#Ref<0.0.11.78165>},
              288246,
              [...many, many session id entries]
              [],false,288432,1163577,nil}
              ** Reason for termination ==
              ** badmatch,{stop,{db_not_found,<<"http://127.0.0.1:5984/node-metrics/">>},
              [{couch_rep,do_checkpoint,1},
              {couch_rep,handle_cast,2},
              {gen_server,handle_msg,5},
              {proc_lib,init_p_do_apply,3}]}

              =ERROR REPORT==== 1-Apr-2010::14:25:26 ===
              ** Generic server <0.274.0> terminating
              ** Last message in was {'$gen_cast',do_checkpoint}
            • When Server state == {state,<0.20538.1>,<0.20542.1>,<0.20545.1>,
              <0.20547.1>,
              Unknown macro: {http_db,"http}

              ,
              true,false,
              ["f3e3081db5a215dbaf9b2984f0552090",

              Unknown macro: {[{<<"target">>, <<"http://1.2.3.5:5984/node-metrics">>}, {<<"source">>, <<"http://127.0.0.1:5984/node-metrics">>}, {<<"continuous">>,true}]}

              ,

              Unknown macro: {user_ctx,null, [<<"_admin">>], <<"{couch_httpd_auth, default_authentication_handler}">>}

              ],

              {1270124726131655,#Ref<0.0.11.78165>}

              ,
              288246,
              [...many, many session id entries]
              [],false,288432,1163577,nil}

            • Reason for termination ==
            • {{badmatch,{stop, {db_not_found,<<"http://127.0.0.1:5984/node-metrics/">>}

              }},
              [

              {couch_rep,do_checkpoint,1},
              {couch_rep,handle_cast,2},
              {gen_server,handle_msg,5},
              {proc_lib,init_p_do_apply,3}]}
              [error] [<0.274.0>] {error_report,<0.31.0>,
              {<0.274.0>,crash_report,
              [[{initial_call,{couch_rep,init,['Argument__1']}},
              {pid,<0.274.0>},
              {registered_name,[]},
              {error_info,
              {exit,
              {{badmatch,
              {stop,
              {db_not_found, <<"http://127.0.0.1:5984/node-metrics/">>}}},
              [{couch_rep,do_checkpoint,1}

              ,

              {couch_rep,handle_cast,2}

              ,

              {gen_server,handle_msg,5}

              ,

              {proc_lib,init_p_do_apply,3}]},
              [{gen_server,terminate,6},{proc_lib,init_p_do_apply,3}

              ]}},

              {ancestors, [couch_rep_sup,couch_primary_services,couch_server_sup,<0.32.0>]}

              ,

              Unknown macro: {messages,[{'EXIT',<0.21084.1>,normal}]}

              ,

              {links,[<0.81.0>]}

              ,
              {dictionary,[{task_status_update,{{1270,124726,124009},0}}]},

              {trap_exit,true}

              ,

              {status,running}

              ,

              {heap_size,10946}

              ,

              {stack_size,24}

              ,

              {reductions,29173458}

              ],
              []]}}

          =CRASH REPORT==== 1-Apr-2010::14:25:26 ===
          [...follows below...]

          Fredrik Widlund, CSO / Chief Architect, Qbrick
          Direct: +46 8 459 90 32 | Mobile: +46 76 899 96 66

          Södra Hamnvägen 22 | 115 41 STOCKHOLM
          Web and mobile: www.qbrick.com

          ----Ursprungligt meddelande----
          Från: Randall Leeds (JIRA) jira@apache.org
          Skickat: den 1 april 2010 21:12
          Till: Fredrik Widlund
          Ämne: [jira] Commented: (COUCHDB-722) Continuous replication tasks fail

          [ https://issues.apache.org/jira/browse/COUCHDB-722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852510#action_12852510 ]

          Randall Leeds commented on COUCHDB-722:
          ---------------------------------------

          I'm rather confused.

          The compaction seems to be on the service-metrics database, but the replication is between databases named node-metrics.
          However, there's a POST to /service-metrics/_missing_revs on the target database right around the time compaction completes. Replication performs this operation. Are you using vhosts or some kind of proxy layer that's rewriting any of your requests? Could you include a little bit more context at the end where you put the ...? In particular I want to know if the replication was using the service-metrics database at all.


          This message is automatically generated by JIRA.
          -
          You can reply to this email to add a comment to the issue online.

          Show
          Fredrik Widlund added a comment - The service-metrics database is also replicated, to the same target. The couchdb instances are communicating directly to each other without any proxy, rewriting or address translating. I'm afraid the entries from the last mail probably was a crash on the opposite instance. The below should be from the same crash as the first one. This crash actually didn’t have the completion of the service-metrics compact directly before it. [info] [<0.20666.1>] 127.0.0.1 - - 'POST' /node-metrics/_ensure_full_commit?seq=288432 201 [info] [<0.274.0>] rebooting http://127.0.0.1:5984/node-metrics/ -> http://1.2.3.5:5984/node-metrics/ from last known repl\ ication checkpoint [error] [<0.274.0>] ** Generic server <0.274.0> terminating Last message in was {'$gen_cast',do_checkpoint} ** When Server state == {state,<0.20538.1>,<0.20542.1>,<0.20545.1>, <0.20547.1>, {http_db,"http://127.0.0.1:5984/node-metrics/", [],[], [{"User-Agent","CouchDB/0.11.0"}, {"Accept","application/json"}, {"Accept-Encoding","gzip"}], [],get,nil, [{response_format,binary}, {inactivity_timeout,30000}], 10,500,nil}, {http_db, "http://1.2.3.5:5984/node-metrics/",[], [], [{"User-Agent","CouchDB/0.11.0"}, {"Accept","application/json"}, {"Accept-Encoding","gzip"}], [],get,nil, [{response_format,binary}, {inactivity_timeout,30000}], 10,500,nil}, true,false, ["f3e3081db5a215dbaf9b2984f0552090", {[{<<"target">>, <<"http://1.2.3.5:5984/node-metrics">>}, {<<"source">>, <<"http://127.0.0.1:5984/node-metrics">>}, {<<"continuous">>,true}]}, {user_ctx,null, [<<"_admin">>] , <<"{couch_httpd_auth, default_authentication_handler}">>}], {1270124726131655,#Ref<0.0.11.78165>}, 288246, [...many, many session id entries] [],false,288432,1163577,nil} ** Reason for termination == ** badmatch,{stop,{db_not_found,<<"http://127.0.0.1:5984/node-metrics/">>} , [{couch_rep,do_checkpoint,1}, {couch_rep,handle_cast,2}, {gen_server,handle_msg,5}, {proc_lib,init_p_do_apply,3}]} =ERROR REPORT==== 1-Apr-2010::14:25:26 === ** Generic server <0.274.0> terminating ** Last message in was {'$gen_cast',do_checkpoint} When Server state == {state,<0.20538.1>,<0.20542.1>,<0.20545.1>, <0.20547.1>, Unknown macro: {http_db,"http} , true,false, ["f3e3081db5a215dbaf9b2984f0552090", Unknown macro: {[{<<"target">>, <<"http://1.2.3.5:5984/node-metrics">>}, {<<"source">>, <<"http://127.0.0.1:5984/node-metrics">>}, {<<"continuous">>,true}]} , Unknown macro: {user_ctx,null, [<<"_admin">>], <<"{couch_httpd_auth, default_authentication_handler}">>} ], {1270124726131655,#Ref<0.0.11.78165>} , 288246, [...many, many session id entries] [],false,288432,1163577,nil} Reason for termination == {{badmatch,{stop, {db_not_found,<<"http://127.0.0.1:5984/node-metrics/">>} }}, [ {couch_rep,do_checkpoint,1}, {couch_rep,handle_cast,2}, {gen_server,handle_msg,5}, {proc_lib,init_p_do_apply,3}]} [error] [<0.274.0>] {error_report,<0.31.0>, {<0.274.0>,crash_report, [[{initial_call,{couch_rep,init, ['Argument__1'] }}, {pid,<0.274.0>}, {registered_name,[]}, {error_info, {exit, {{badmatch, {stop, {db_not_found, <<"http://127.0.0.1:5984/node-metrics/">>}}}, [{couch_rep,do_checkpoint,1} , {couch_rep,handle_cast,2} , {gen_server,handle_msg,5} , {proc_lib,init_p_do_apply,3}]}, [{gen_server,terminate,6},{proc_lib,init_p_do_apply,3} ]}}, {ancestors, [couch_rep_sup,couch_primary_services,couch_server_sup,<0.32.0>]} , Unknown macro: {messages,[{'EXIT',<0.21084.1>,normal}]} , {links,[<0.81.0>]} , {dictionary, [{task_status_update,{{1270,124726,124009},0}}] }, {trap_exit,true} , {status,running} , {heap_size,10946} , {stack_size,24} , {reductions,29173458} ], []]}} =CRASH REPORT==== 1-Apr-2010::14:25:26 === [...follows below...] Fredrik Widlund, CSO / Chief Architect, Qbrick Direct: +46 8 459 90 32 | Mobile: +46 76 899 96 66 Södra Hamnvägen 22 | 115 41 STOCKHOLM Web and mobile: www.qbrick.com ---- Ursprungligt meddelande ---- Från: Randall Leeds (JIRA) jira@apache.org Skickat: den 1 april 2010 21:12 Till: Fredrik Widlund Ämne: [jira] Commented: ( COUCHDB-722 ) Continuous replication tasks fail [ https://issues.apache.org/jira/browse/COUCHDB-722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852510#action_12852510 ] Randall Leeds commented on COUCHDB-722 : --------------------------------------- I'm rather confused. The compaction seems to be on the service-metrics database, but the replication is between databases named node-metrics. However, there's a POST to /service-metrics/_missing_revs on the target database right around the time compaction completes. Replication performs this operation. Are you using vhosts or some kind of proxy layer that's rewriting any of your requests? Could you include a little bit more context at the end where you put the ...? In particular I want to know if the replication was using the service-metrics database at all. – This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
          Hide
          Randall Leeds added a comment -

          I'm rather confused.

          The compaction seems to be on the service-metrics database, but the replication is between databases named node-metrics.
          However, there's a POST to /service-metrics/_missing_revs on the target database right around the time compaction completes. Replication performs this operation. Are you using vhosts or some kind of proxy layer that's rewriting any of your requests? Could you include a little bit more context at the end where you put the ...? In particular I want to know if the replication was using the service-metrics database at all.

          Show
          Randall Leeds added a comment - I'm rather confused. The compaction seems to be on the service-metrics database, but the replication is between databases named node-metrics. However, there's a POST to /service-metrics/_missing_revs on the target database right around the time compaction completes. Replication performs this operation. Are you using vhosts or some kind of proxy layer that's rewriting any of your requests? Could you include a little bit more context at the end where you put the ...? In particular I want to know if the replication was using the service-metrics database at all.
          Hide
          Fredrik Widlund added a comment -

          Hi,

          Probably a more informative log:

          [info] [<0.26977.2>] 1.2.3.4 - - 'POST' /service-metrics/_compact 202
          [info] [<0.146.0>] Starting compaction for db "service-metrics"
          [info] [<0.26627.2>] 127.0.0.1 - - 'GET' /service-metrics/Mon0.n6-www0.n101?open_revs=["56844-2393e6afa315d62d6f98996a5402f0f7"]&\
          revs=true&latest=true 200
          [info] [<0.26704.2>] 1.2.3.5 - - 'POST' /node-metrics/_missing_revs 200
          [info] [<0.26755.2>] 1.2.3.5 - - 'POST' /service-metrics/_missing_revs 200
          [info] [<0.26977.2>] 1.2.3.4 - - 'PUT' /service-metrics/Mon1.n7-www0.n102 201
          [info] [<0.26627.2>] 127.0.0.1 - - 'GET' /service-metrics/Mon1.n7-www0.n102?open_revs=["23834-9d230c7449a9321e51e9d5983ef00d47"]&\
          revs=true&latest=true 200
          [info] [<0.26704.2>] 1.2.3.5 - - 'POST' /service-metrics/_missing_revs 200
          [info] [<0.26977.2>] 1.2.3.4 - - 'PUT' /node-metrics/Mon1.n7-n302 201
          [info] [<0.26988.2>] 1.2.3.4 - - 'PUT' /service-metrics/Mon1.n7-www0.n301 201
          [info] [<0.26627.2>] 127.0.0.1 - - 'GET' /node-metrics/Mon1.n7-n302?open_revs=["21751-f857ed9d519bff3f054abcd990a8182c"]&revs=tru\
          e&latest=true 200
          [info] [<0.26682.2>] 127.0.0.1 - - 'GET' /service-metrics/Mon1.n7-www0.n301?open_revs=["26450-c16e040281883c61c62ef7d2c4f2a7ef"]&\
          revs=true&latest=true 200
          [info] [<0.26755.2>] 1.2.3.5 - - 'POST' /service-metrics/_missing_revs 200
          [info] [<0.26704.2>] 1.2.3.5 - - 'POST' /service-metrics/_bulk_docs 201
          [info] [<0.26988.2>] 1.2.3.4 - - 'PUT' /service-metrics/Mon1.n7-fl1.ds18 201
          [info] [<0.26755.2>] 1.2.3.5 - - 'POST' /service-metrics/_missing_revs 200
          [info] [<0.26704.2>] 1.2.3.5 - - 'POST' /node-metrics/_missing_revs 200
          [info] [<0.26627.2>] 127.0.0.1 - - 'GET' /service-metrics/Mon1.n7-fl1.ds18?open_revs=["21147-d965415af5ac43e96f94b9df5cdf7b2f"]&r\
          evs=true&latest=true 200
          [info] [<0.146.0>] Compaction file still behind main file (update seq=359295. compact update seq=359291). Retrying.
          [info] [<0.26704.2>] 1.2.3.5 - - 'POST' /service-metrics/_missing_revs 200
          [info] [<0.26988.2>] 1.2.3.4 - - 'PUT' /service-metrics/Mon1.n7-www0.n101 201
          [info] [<0.146.0>] Compaction file still behind main file (update seq=359296. compact update seq=359295). Retrying.
          [info] [<0.26627.2>] 127.0.0.1 - - 'GET' /service-metrics/Mon1.n7-www0.n101?open_revs=["21916-c83856ef70eb8dfcd8f3449406fb4a02"]&\
          revs=true&latest=true 200
          [info] [<0.146.0>] Compaction for db "service-metrics" completed.
          [info] [<0.26704.2>] 1.2.3.5 - - 'POST' /service-metrics/_missing_revs 200
          [info] [<0.26627.2>] 127.0.0.1 - - 'POST' /node-metrics/_ensure_full_commit?seq=266207 201
          [info] [<0.13563.1>] rebooting http://127.0.0.1:5984/node-metrics/ -> http://1.2.3.5:5984/node-metrics/ from last known replicati\
          on checkpoint
          [error] [<0.13563.1>] ** Generic server <0.13563.1> terminating

            • Last message in was {'$gen_cast',do_checkpoint}
            • When Server state == {state,<0.26586.2>,<0.26590.2>,<0.26593.2>,
              <0.26595.2>,
              Unknown macro: {http_db,"http}

              ,
              {http_db,
              [...]

          Fredrik Widlund, CSO / Chief Architect, Qbrick
          Direct: +46 8 459 90 32 | Mobile: +46 76 899 96 66

          Södra Hamnvägen 22 | 115 41 STOCKHOLM
          Web and mobile: www.qbrick.com

          ----Ursprungligt meddelande----
          Från: Randall Leeds (JIRA) jira@apache.org
          Skickat: den 1 april 2010 18:28
          Till: Fredrik Widlund
          Ämne: [jira] Commented: (COUCHDB-722) Continuous replication tasks fail

          [ https://issues.apache.org/jira/browse/COUCHDB-722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852437#action_12852437 ]

          Randall Leeds commented on COUCHDB-722:
          ---------------------------------------

          It would be helpful to know if this happens only when compaction completes.

          The replicator has retry logic for transient failures, but that does not include a 404 response from the source. IMO that's a bug in the compaction code.

          I'll take a closer look, though.


          This message is automatically generated by JIRA.
          -
          You can reply to this email to add a comment to the issue online.

          Show
          Fredrik Widlund added a comment - Hi, Probably a more informative log: [info] [<0.26977.2>] 1.2.3.4 - - 'POST' /service-metrics/_compact 202 [info] [<0.146.0>] Starting compaction for db "service-metrics" [info] [<0.26627.2>] 127.0.0.1 - - 'GET' /service-metrics/Mon0.n6-www0.n101?open_revs= ["56844-2393e6afa315d62d6f98996a5402f0f7"] &\ revs=true&latest=true 200 [info] [<0.26704.2>] 1.2.3.5 - - 'POST' /node-metrics/_missing_revs 200 [info] [<0.26755.2>] 1.2.3.5 - - 'POST' /service-metrics/_missing_revs 200 [info] [<0.26977.2>] 1.2.3.4 - - 'PUT' /service-metrics/Mon1.n7-www0.n102 201 [info] [<0.26627.2>] 127.0.0.1 - - 'GET' /service-metrics/Mon1.n7-www0.n102?open_revs= ["23834-9d230c7449a9321e51e9d5983ef00d47"] &\ revs=true&latest=true 200 [info] [<0.26704.2>] 1.2.3.5 - - 'POST' /service-metrics/_missing_revs 200 [info] [<0.26977.2>] 1.2.3.4 - - 'PUT' /node-metrics/Mon1.n7-n302 201 [info] [<0.26988.2>] 1.2.3.4 - - 'PUT' /service-metrics/Mon1.n7-www0.n301 201 [info] [<0.26627.2>] 127.0.0.1 - - 'GET' /node-metrics/Mon1.n7-n302?open_revs= ["21751-f857ed9d519bff3f054abcd990a8182c"] &revs=tru\ e&latest=true 200 [info] [<0.26682.2>] 127.0.0.1 - - 'GET' /service-metrics/Mon1.n7-www0.n301?open_revs= ["26450-c16e040281883c61c62ef7d2c4f2a7ef"] &\ revs=true&latest=true 200 [info] [<0.26755.2>] 1.2.3.5 - - 'POST' /service-metrics/_missing_revs 200 [info] [<0.26704.2>] 1.2.3.5 - - 'POST' /service-metrics/_bulk_docs 201 [info] [<0.26988.2>] 1.2.3.4 - - 'PUT' /service-metrics/Mon1.n7-fl1.ds18 201 [info] [<0.26755.2>] 1.2.3.5 - - 'POST' /service-metrics/_missing_revs 200 [info] [<0.26704.2>] 1.2.3.5 - - 'POST' /node-metrics/_missing_revs 200 [info] [<0.26627.2>] 127.0.0.1 - - 'GET' /service-metrics/Mon1.n7-fl1.ds18?open_revs= ["21147-d965415af5ac43e96f94b9df5cdf7b2f"] &r\ evs=true&latest=true 200 [info] [<0.146.0>] Compaction file still behind main file (update seq=359295. compact update seq=359291). Retrying. [info] [<0.26704.2>] 1.2.3.5 - - 'POST' /service-metrics/_missing_revs 200 [info] [<0.26988.2>] 1.2.3.4 - - 'PUT' /service-metrics/Mon1.n7-www0.n101 201 [info] [<0.146.0>] Compaction file still behind main file (update seq=359296. compact update seq=359295). Retrying. [info] [<0.26627.2>] 127.0.0.1 - - 'GET' /service-metrics/Mon1.n7-www0.n101?open_revs= ["21916-c83856ef70eb8dfcd8f3449406fb4a02"] &\ revs=true&latest=true 200 [info] [<0.146.0>] Compaction for db "service-metrics" completed. [info] [<0.26704.2>] 1.2.3.5 - - 'POST' /service-metrics/_missing_revs 200 [info] [<0.26627.2>] 127.0.0.1 - - 'POST' /node-metrics/_ensure_full_commit?seq=266207 201 [info] [<0.13563.1>] rebooting http://127.0.0.1:5984/node-metrics/ -> http://1.2.3.5:5984/node-metrics/ from last known replicati\ on checkpoint [error] [<0.13563.1>] ** Generic server <0.13563.1> terminating Last message in was {'$gen_cast',do_checkpoint} When Server state == {state,<0.26586.2>,<0.26590.2>,<0.26593.2>, <0.26595.2>, Unknown macro: {http_db,"http} , {http_db, [...] Fredrik Widlund, CSO / Chief Architect, Qbrick Direct: +46 8 459 90 32 | Mobile: +46 76 899 96 66 Södra Hamnvägen 22 | 115 41 STOCKHOLM Web and mobile: www.qbrick.com ---- Ursprungligt meddelande ---- Från: Randall Leeds (JIRA) jira@apache.org Skickat: den 1 april 2010 18:28 Till: Fredrik Widlund Ämne: [jira] Commented: ( COUCHDB-722 ) Continuous replication tasks fail [ https://issues.apache.org/jira/browse/COUCHDB-722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852437#action_12852437 ] Randall Leeds commented on COUCHDB-722 : --------------------------------------- It would be helpful to know if this happens only when compaction completes. The replicator has retry logic for transient failures, but that does not include a 404 response from the source. IMO that's a bug in the compaction code. I'll take a closer look, though. – This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
          Hide
          Randall Leeds added a comment -

          It would be helpful to know if this happens only when compaction completes.

          The replicator has retry logic for transient failures, but that does not include a 404 response from the source. IMO that's a bug in the compaction code.

          I'll take a closer look, though.

          Show
          Randall Leeds added a comment - It would be helpful to know if this happens only when compaction completes. The replicator has retry logic for transient failures, but that does not include a 404 response from the source. IMO that's a bug in the compaction code. I'll take a closer look, though.

            People

            • Assignee:
              Unassigned
              Reporter:
              Fredrik Widlund
            • Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development