CouchDB
  1. CouchDB
  2. COUCHDB-1364

Replication hanging/failing on docs with lots of revisions

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 1.0.3, 1.1.1
    • Fix Version/s: None
    • Component/s: Replication
    • Environment:

      Description

      We have a setup where replication from a 1.1.1 couch is hanging - this is WAN replication which previously worked 1.0.3 <-> 1.0.3.

      Replicating from the 1.1.1 -> 1.0.3 showed an error very similar to COUCHDB-1340 - which I presumed meant the url was too long. So I upgraded the 1.0.3 couch to our 1.1.1 build which had this patched.

      However - the replication between the 2 1.1.1 couches is hanging at a certain point when doing continuous pull replication - it doesn't checkpoint, just stays on "starting" however, when cancelled and restarted it gets the latest documents (so doc counts are equal). The last calls I see to the source db when it hangs are multiple long GETs for a document with 2051 open revisions on the source and 498 on the target.

      When doing a push replication the _replicate call just gives a 500 error (at about the same seq id as the pull replication hangs at) saying:

      [Thu, 15 Dec 2011 10:09:17 GMT] [error] [<0.11306.115>] changes_loop died with reason {noproc,
      {gen_server,call,
      [<0.6382.115>,

      {pread_iolist, 79043596434}

      ,
      infinity]}}

      when the last call in the target of the push replication is:

      [Thu, 15 Dec 2011 10:09:17 GMT] [info] [<0.580.50>] 10.35.9.79 - - 'POST' /master_db/_missing_revs 200

      with no stack trace.

      Comparing the open_revs=all count on the documents with many open revs shows differing numbers on each side of the replication WAN and between different couches in the same datacentre. Some of these documents have not been updated for months. Is it possible that 1.0.3 just skipped over this issue and carried on replicating, but 1.1.1 does not?

      I know I can hack the replication to work by updating the checkpoint seq past this point in the _local document, but I think there is a real bug here somewhere.

      If wireshark/debug data is required, please say

      1. replication error changes_loop died redacted.txt
        8 kB
        Alex Markham
      2. COUCHDB-1364-11x.patch
        0.8 kB
        Filipe Manana
      3. do_checkpoint error push.txt
        33 kB
        Alex Markham
      4. couchlog source host17.log
        47 kB
        Alex Markham
      5. checkpoint hang seq changes.txt
        179 kB
        Alex Markham
      6. couchlog target host32.txt
        0.6 kB
        Alex Markham

        Activity

        Hide
        Alex Markham added a comment -

        We have sidestepped this issue with a replication filter for 3 troublesome documents with many open_revs. Everything seems smooth again now.

        Show
        Alex Markham added a comment - We have sidestepped this issue with a replication filter for 3 troublesome documents with many open_revs. Everything seems smooth again now.
        Alex Markham made changes -
        Attachment couchlog source host17.log [ 12507671 ]
        Attachment checkpoint hang seq changes.txt [ 12507672 ]
        Attachment couchlog target host32.txt [ 12507673 ]
        Hide
        Alex Markham added a comment -

        Ok, perhaps that stack trace was a separate issue. - with it patched, push replication just times out with the same error as the do_checkpoint error push.txt

        I have done some more analysis - and found 2 seq ids where replication hangs from host 17 -> 32. After this point there are no more checkpoints. I managed to get continuous replication working again past it by manually updating the replication _local document on both sources to 1 source_last_seq id after this point.

        I attach 3 logs:
        1) checkpoint hang seq changes - this is the changes feed from the source at the 2 points in replication which hang - both are docs with very high number of open revisions.
        2) couchlog source host 17.log - this is the log of the couch where you can see the GET requests for those documents being made, and then replication starting again.
        3) couchlog target host32.txt - this is the log of the couch doing the pull replication - the target, there were no errors here.

        I doubt the network is the issue here, all other databases replicate fine, and this replication stream works once manually tweaked to go past those choke points.

        Is it possible that the database is trying to commit all these hundreds of documents at once and it takes a crazy long time? I have set up a parallel replication to the same host using a different hostname (so that the replication id is different) without the "cancel:true" that I send every minute and it has hung on that checkpoint (2745054) for the last few hours

        Show
        Alex Markham added a comment - Ok, perhaps that stack trace was a separate issue. - with it patched, push replication just times out with the same error as the do_checkpoint error push.txt I have done some more analysis - and found 2 seq ids where replication hangs from host 17 -> 32. After this point there are no more checkpoints. I managed to get continuous replication working again past it by manually updating the replication _local document on both sources to 1 source_last_seq id after this point. I attach 3 logs: 1) checkpoint hang seq changes - this is the changes feed from the source at the 2 points in replication which hang - both are docs with very high number of open revisions. 2) couchlog source host 17.log - this is the log of the couch where you can see the GET requests for those documents being made, and then replication starting again. 3) couchlog target host32.txt - this is the log of the couch doing the pull replication - the target, there were no errors here. I doubt the network is the issue here, all other databases replicate fine, and this replication stream works once manually tweaked to go past those choke points. Is it possible that the database is trying to commit all these hundreds of documents at once and it takes a crazy long time? I have set up a parallel replication to the same host using a different hostname (so that the replication id is different) without the "cancel:true" that I send every minute and it has hung on that checkpoint (2745054) for the last few hours
        Hide
        Filipe Manana added a comment -

        Alex, the patch is for the side doing the push replication. It's only meant to fix the following stack trace you pasted:

        [Thu, 15 Dec 2011 10:09:17 GMT] [error] [<0.11306.115>] changes_loop died with reason {noproc,
        {gen_server,call,
        [<0.6382.115>,

        {pread_iolist, 79043596434}

        ,
        infinity]}}

        Your error with the _ensure_full_commit seems to be because that http request failed 10 times. At that point the replication process crashes.
        Before the first retry it waits 0.5 seconds, before the 2nd retry it waits 1 second, etc (it always doubles). So it takes about 8.5 minutes before it crashes.
        Maybe your network is too busy.

        Show
        Filipe Manana added a comment - Alex, the patch is for the side doing the push replication. It's only meant to fix the following stack trace you pasted: [Thu, 15 Dec 2011 10:09:17 GMT] [error] [<0.11306.115>] changes_loop died with reason {noproc, {gen_server,call, [<0.6382.115>, {pread_iolist, 79043596434} , infinity]}} Your error with the _ensure_full_commit seems to be because that http request failed 10 times. At that point the replication process crashes. Before the first retry it waits 0.5 seconds, before the 2nd retry it waits 1 second, etc (it always doubles). So it takes about 8.5 minutes before it crashes. Maybe your network is too busy.
        Hide
        Alex Markham added a comment -

        I tried the patch on both host 25 and 28 and tried pushing again, but it didn't work.
        Is there a possibility that I have 2 distinct issues here?

        more info for 25/28 - I think this is a doc it is hanging on:
        for i in 25 28; do ssh au$i hostname; curl "http://host$i:5984/master_db/304bc614b3028de0352ec813088bde2f?open_revs=all" -s | grep _rev |wc -l; done
        host25
        17
        host28
        308

        Show
        Alex Markham added a comment - I tried the patch on both host 25 and 28 and tried pushing again, but it didn't work. Is there a possibility that I have 2 distinct issues here? more info for 25/28 - I think this is a doc it is hanging on: for i in 25 28; do ssh au$i hostname; curl "http://host$i:5984/master_db/304bc614b3028de0352ec813088bde2f?open_revs=all" -s | grep _rev |wc -l; done host25 17 host28 308
        Alex Markham made changes -
        Attachment do_checkpoint error push.txt [ 12507528 ]
        Hide
        Alex Markham added a comment - - edited

        Hi Felipe - which couch (on which end of the replication) needs to be updated?

        I looked at wireshark for the pull and push replication, from host28 -> host25
        For the Pull - the replication seems to start, fetch the changes list from seq 3905565 and then POSTs an ensure full commit. There doesn't seem to be a reply from this so it just ends up hanging. My replication script cancels the ongoing replication and then restarts it every 5 mins and this seems to take much longer than that.

        POST /master_db/_ensure_full_commit?seq=3914198 HTTP/1.1
        User-Agent: CouchDB/1.1.1
        Accept: application/json
        Accept-Encoding: gzip
        Content-Type: application/json
        Content-Length: 0
        Host: host28:5984

        I also have a different stack trace for what I think is the same problem - "do_checkpoint error.txt" where the last wireshark activity appeared to be a /_ensure_full_commit at 12:12:04 and at 12:12:34 the timeout error appeared and replication failed

        POST /master_db/_ensure_full_commit HTTP/1.1
        User-Agent: CouchDB/1.1.1
        Accept: application/json
        Accept-Encoding: gzip
        Content-Type: application/json
        Content-Length: 0
        Host: host25:5984

        edit: correcting seq number that _changes?since= calls

        Show
        Alex Markham added a comment - - edited Hi Felipe - which couch (on which end of the replication) needs to be updated? I looked at wireshark for the pull and push replication, from host28 -> host25 For the Pull - the replication seems to start, fetch the changes list from seq 3905565 and then POSTs an ensure full commit. There doesn't seem to be a reply from this so it just ends up hanging. My replication script cancels the ongoing replication and then restarts it every 5 mins and this seems to take much longer than that. POST /master_db/_ensure_full_commit?seq=3914198 HTTP/1.1 User-Agent: CouchDB/1.1.1 Accept: application/json Accept-Encoding: gzip Content-Type: application/json Content-Length: 0 Host: host28:5984 I also have a different stack trace for what I think is the same problem - "do_checkpoint error.txt" where the last wireshark activity appeared to be a /_ensure_full_commit at 12:12:04 and at 12:12:34 the timeout error appeared and replication failed POST /master_db/_ensure_full_commit HTTP/1.1 User-Agent: CouchDB/1.1.1 Accept: application/json Accept-Encoding: gzip Content-Type: application/json Content-Length: 0 Host: host25:5984 edit: correcting seq number that _changes?since= calls
        Filipe Manana made changes -
        Attachment COUCHDB-1364-11x.patch [ 12507511 ]
        Hide
        Filipe Manana added a comment -

        Alex, you might want to try this patch and see if it fixes the push replication case.
        Thanks.

        Show
        Filipe Manana added a comment - Alex, you might want to try this patch and see if it fixes the push replication case. Thanks.
        Hide
        Alex Markham added a comment -

        Yes - I compacted both sides in the hour preceding this.

        Show
        Alex Markham added a comment - Yes - I compacted both sides in the hour preceding this.
        Hide
        Filipe Manana added a comment -

        Hi Alex.
        For the push replication case, right before the error, was the local source database compacted?

        Show
        Filipe Manana added a comment - Hi Alex. For the push replication case, right before the error, was the local source database compacted?
        Alex Markham made changes -
        Field Original Value New Value
        Attachment replication error changes_loop died redacted.txt [ 12507502 ]
        Hide
        Alex Markham added a comment -

        attached longer stack trace of the /_replicate 500 error.

        Show
        Alex Markham added a comment - attached longer stack trace of the /_replicate 500 error.
        Alex Markham created issue -

          People

          • Assignee:
            Unassigned
            Reporter:
            Alex Markham
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:

              Development