CouchDB
  1. CouchDB
  2. COUCHDB-1364

Replication hanging/failing on docs with lots of revisions

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 1.0.3, 1.1.1
    • Fix Version/s: None
    • Component/s: Replication
    • Environment:

      Description

      We have a setup where replication from a 1.1.1 couch is hanging - this is WAN replication which previously worked 1.0.3 <-> 1.0.3.

      Replicating from the 1.1.1 -> 1.0.3 showed an error very similar to COUCHDB-1340 - which I presumed meant the url was too long. So I upgraded the 1.0.3 couch to our 1.1.1 build which had this patched.

      However - the replication between the 2 1.1.1 couches is hanging at a certain point when doing continuous pull replication - it doesn't checkpoint, just stays on "starting" however, when cancelled and restarted it gets the latest documents (so doc counts are equal). The last calls I see to the source db when it hangs are multiple long GETs for a document with 2051 open revisions on the source and 498 on the target.

      When doing a push replication the _replicate call just gives a 500 error (at about the same seq id as the pull replication hangs at) saying:

      [Thu, 15 Dec 2011 10:09:17 GMT] [error] [<0.11306.115>] changes_loop died with reason {noproc,
      {gen_server,call,
      [<0.6382.115>,

      {pread_iolist, 79043596434}

      ,
      infinity]}}

      when the last call in the target of the push replication is:

      [Thu, 15 Dec 2011 10:09:17 GMT] [info] [<0.580.50>] 10.35.9.79 - - 'POST' /master_db/_missing_revs 200

      with no stack trace.

      Comparing the open_revs=all count on the documents with many open revs shows differing numbers on each side of the replication WAN and between different couches in the same datacentre. Some of these documents have not been updated for months. Is it possible that 1.0.3 just skipped over this issue and carried on replicating, but 1.1.1 does not?

      I know I can hack the replication to work by updating the checkpoint seq past this point in the _local document, but I think there is a real bug here somewhere.

      If wireshark/debug data is required, please say

      1. replication error changes_loop died redacted.txt
        8 kB
        Alex Markham
      2. do_checkpoint error push.txt
        33 kB
        Alex Markham
      3. couchlog target host32.txt
        0.6 kB
        Alex Markham
      4. couchlog source host17.log
        47 kB
        Alex Markham
      5. COUCHDB-1364-11x.patch
        0.8 kB
        Filipe Manana
      6. checkpoint hang seq changes.txt
        179 kB
        Alex Markham

        Activity

        Alex Markham made changes -
        Attachment couchlog source host17.log [ 12507671 ]
        Attachment checkpoint hang seq changes.txt [ 12507672 ]
        Attachment couchlog target host32.txt [ 12507673 ]
        Alex Markham made changes -
        Attachment do_checkpoint error push.txt [ 12507528 ]
        Filipe Manana made changes -
        Attachment COUCHDB-1364-11x.patch [ 12507511 ]
        Alex Markham made changes -
        Field Original Value New Value
        Attachment replication error changes_loop died redacted.txt [ 12507502 ]
        Alex Markham created issue -

          People

          • Assignee:
            Unassigned
            Reporter:
            Alex Markham
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:

              Development