Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.2
    • Component/s: Replication
    • Labels:
      None

      Description

      As recently posted to the development mailing:

      http://mail-archives.apache.org/mod_mbox/couchdb-dev/201101.mbox/%3CAANLkTi=NOsVDRUdYHiUGxQCjDXTNGNMOhPm9Bx=rS5ny@mail.gmail.com%3E

      ( shortened URL http://s.apache.org/KsY )

      There's a new replicator implementation. I won't explain here all the details since it's already done by that mail sent to the development list.

      @Adam, do you think you can give it a review?

      The full diff is at: https://github.com/fdmanana/couchdb/compare/trunk_new_replicator

      Would be equally happy if others are able and willing to review and test as well.

      Also take note that some pull replications of databases with attachments created by prior CouchDB releases (including 1.0.1, but excluding the soon to be released 1.0.2) may hang forever. This is fixed by applying the patch for COUCHDB-1022 to the source server.

        Activity

        Hide
        Filipe Manana added a comment -

        Applied to trunk

        Show
        Filipe Manana added a comment - Applied to trunk
        Hide
        Filipe Manana added a comment -

        Randall,

        You're right about that last one. It uses the default headers as part of the input to calculate the replication ID. The User-Agent header, which describes the CouchDB version, would imply replication ID incompatibility between different versions.

        Fixed, https://github.com/fdmanana/couchdb/commit/0685d4100b21406ee54ca8fccfc93e796fd9dd6d

        Show
        Filipe Manana added a comment - Randall, You're right about that last one. It uses the default headers as part of the input to calculate the replication ID. The User-Agent header, which describes the CouchDB version, would imply replication ID incompatibility between different versions. Fixed, https://github.com/fdmanana/couchdb/commit/0685d4100b21406ee54ca8fccfc93e796fd9dd6d
        Hide
        Randall Leeds added a comment -

        I think it is incompatible, but not for the reason I said.

        get_rep_endpoint converts from #httpdb{} to

        {remote, Url, Headers}

        or

        {remote, Url, Headers, OAuthList}

        .

        But the latest code you just linked adds default headers:

        https://github.com/fdmanana/couchdb/blob/trunk_new_replicator/src/couchdb/couch_replicator_utils.erl#L206

        which means we get {remote, Url, [

        {"Accept", "application/json"}

        ,

        {"User-Agent", "CouchDB/" ++ couch_server:get_version()}

        ]}

        compare to

        https://github.com/apache/couchdb/blob/1.0.x/src/couchdb/couch_rep.erl#L524

        {remote, Url, []}

        Also. If we use default headers there that include a different user agent it means we guarantee we break checkpoints on every version upgrade. No?

        Show
        Randall Leeds added a comment - I think it is incompatible, but not for the reason I said. get_rep_endpoint converts from #httpdb{} to {remote, Url, Headers} or {remote, Url, Headers, OAuthList} . But the latest code you just linked adds default headers: https://github.com/fdmanana/couchdb/blob/trunk_new_replicator/src/couchdb/couch_replicator_utils.erl#L206 which means we get {remote, Url, [ {"Accept", "application/json"} , {"User-Agent", "CouchDB/" ++ couch_server:get_version()} ]} compare to https://github.com/apache/couchdb/blob/1.0.x/src/couchdb/couch_rep.erl#L524 {remote, Url, []} Also. If we use default headers there that include a different user agent it means we guarantee we break checkpoints on every version upgrade. No?
        Hide
        Filipe Manana added a comment -

        First of all, thanks everyone for the reviews and +1 votes. This is really important and a major change to add.

        Randall, I don't think there is any incompatibility. The new replicator still uses tuples like

        {remote, Url, []}

        for calculating the replication IDs. Look at:

        https://github.com/fdmanana/couchdb/blob/trunk_new_replicator/src/couchdb/couch_replicator_utils.erl#L122

        and

        https://github.com/fdmanana/couchdb/blob/trunk_new_replicator/src/couchdb/couch_replicator_utils.erl#L168

        get_rep_endoint converts from #httpdb{} to

        {remote, Url, OAuthList}
        Show
        Filipe Manana added a comment - First of all, thanks everyone for the reviews and +1 votes. This is really important and a major change to add. Randall, I don't think there is any incompatibility. The new replicator still uses tuples like {remote, Url, []} for calculating the replication IDs. Look at: https://github.com/fdmanana/couchdb/blob/trunk_new_replicator/src/couchdb/couch_replicator_utils.erl#L122 and https://github.com/fdmanana/couchdb/blob/trunk_new_replicator/src/couchdb/couch_replicator_utils.erl#L168 get_rep_endoint converts from #httpdb{} to {remote, Url, OAuthList}
        Show
        Paul Joseph Davis added a comment - Do it! http://www.youtube.com/watch?v=lmUZGdi7Ty4
        Hide
        Randall Leeds added a comment -

        I'm looking at the new_replicator branch.
        I see a problem in couch_replicator_utils.erl that breaks backwards compatibility with existing replication checkpoints.

        parse_rep_db/2 converts a remote endpoint to #httpdb.
        In the old code, a bare URL in source or target would result in

        {remote, Url, []}

        from get_rep_endpoint/1.
        In the new code, a bare URL becomes #httpdb

        {url=Url, headers = [], oauth = nil}

        . get_rep_endpoint/1 in this case gives

        {remote, Url, [], nil}

        .
        This term is part of the md5 sum to identify the replication and the checkpoint document.

        We can convert everything to the new format, but you should use the code I wrote for couch_rep for finding and migrating old checkpoints.

        This is also a good time to suggest that we extract the meaningful information from UserCtx in the local case instead of using the whole record. Otherwise, changes to the #user_ctx record break checkpoints in a way which is not obvious. I just ran into this problem during upgrade from 0.10.2 to 1.0.2.

        Show
        Randall Leeds added a comment - I'm looking at the new_replicator branch. I see a problem in couch_replicator_utils.erl that breaks backwards compatibility with existing replication checkpoints. parse_rep_db/2 converts a remote endpoint to #httpdb. In the old code, a bare URL in source or target would result in {remote, Url, []} from get_rep_endpoint/1. In the new code, a bare URL becomes #httpdb {url=Url, headers = [], oauth = nil} . get_rep_endpoint/1 in this case gives {remote, Url, [], nil} . This term is part of the md5 sum to identify the replication and the checkpoint document. We can convert everything to the new format, but you should use the code I wrote for couch_rep for finding and migrating old checkpoints. This is also a good time to suggest that we extract the meaningful information from UserCtx in the local case instead of using the whole record. Otherwise, changes to the #user_ctx record break checkpoints in a way which is not obvious. I just ran into this problem during upgrade from 0.10.2 to 1.0.2.
        Hide
        Rachel Willmer added a comment -

        I'd like to discuss/review the test framework for replication, specifically for master-master scenarios. It might be useful to discuss what you currently do w.r.t. test scenarios, compared to our real-world setups to see whether we can either add some extra stress to the existing test scenarios; or find some more edge cases worth testing.

        How about an offline discussion about this? and then we can report back.

        Show
        Rachel Willmer added a comment - I'd like to discuss/review the test framework for replication, specifically for master-master scenarios. It might be useful to discuss what you currently do w.r.t. test scenarios, compared to our real-world setups to see whether we can either add some extra stress to the existing test scenarios; or find some more edge cases worth testing. How about an offline discussion about this? and then we can report back.
        Hide
        Adam Kocoloski added a comment -

        Hi Filipe, I've done a detailed review of the modifications to existing modules and a somewhat more cursory review of the new replicator modules. We've discussed a couple of small changes and you've implemented them. I think it's time for this code to land in trunk

        One future improvement that we talked briefly about using immediate commits in the doc copier. This would remove the need to explicitly call _ensure_full_commit on the target and clean up the code a bit. The replicator's design should allow it to automatically overcome any additional per-request latency through concurrency and batch loading.

        Awesome work!

        Show
        Adam Kocoloski added a comment - Hi Filipe, I've done a detailed review of the modifications to existing modules and a somewhat more cursory review of the new replicator modules. We've discussed a couple of small changes and you've implemented them. I think it's time for this code to land in trunk One future improvement that we talked briefly about using immediate commits in the doc copier. This would remove the need to explicitly call _ensure_full_commit on the target and clean up the code a bit. The replicator's design should allow it to automatically overcome any additional per-request latency through concurrency and batch loading. Awesome work!
        Hide
        Adam Kocoloski added a comment -

        Thanks Filipe, will review. Very happy to see this work landing!

        Show
        Adam Kocoloski added a comment - Thanks Filipe, will review. Very happy to see this work landing!

          People

          • Assignee:
            Filipe Manana
            Reporter:
            Filipe Manana
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development