CouchDB
  1. CouchDB
  2. COUCHDB-639

Make replication profit of attachment compression and improve push replication for large attachments

    Details

      Description

      At the moment, for compressed attachments, the replication uncompresses and then compresses again the attachments. Therefore, a waste of CPU time.

      The push replication is also not reliable for very large attachments (500mb + for example). Currently it sends the attachments in-lined in the respective JSON doc. Not only this requires too much ram memory, it also wastes too much CPU time doing the base64 encoding of the attachment (and also a decompression if the attachment is compressed).

      The following patch (rep-att-comp-and-multipart-trunk*.patch) addresses both issues. Docs containing attachments are now streamed to the target remote DB using the multipart doc streaming feature provided by couch_doc.erl, and compressed attachments are not uncompressed and re-compressed during the replication

      JavaScript tests included.

      Previously doing a replication of a DB containing 2 docs with attachments of 100mb and 500mb caused the Erlang VM to consume near 1.2GB of ram memory in my system. With that patch applied, it uses about 130Mb of ram memory.

        Issue Links

          Activity

          Hide
          Filipe Manana added a comment -

          Just eliminated a useless line that was adding "Accept-Encoding: gzip" to the attachment streaming request. This header is set by default in the definition of #http_db in couch_db.hrl

          Show
          Filipe Manana added a comment - Just eliminated a useless line that was adding "Accept-Encoding: gzip" to the attachment streaming request. This header is set by default in the definition of #http_db in couch_db.hrl
          Hide
          Filipe Manana added a comment -

          At the moment, for compressed attachments, the replication uncompresses and then compresses again the attachments. Therefore, a waste of CPU time.

          The push replication is also not reliable for very large attachments (500mb + for example). Currently it sends the attachments in-lined in the respective JSON doc. Not only this requires too much ram memory, it also wastes too much CPU time doing the base64 encoding of the attachment (and also a decompression if the attachment is compressed).

          The following patch (rep-att-comp-and-multipart-trunk*.patch) addresses both issues. Docs containing attachments are now streamed to the target remote DB using the multipart doc streaming feature provided by couch_doc.erl, and compressed attachments are not uncompressed and re-compressed during the replication

          JavaScript tests included.

          Previously doing a replication of a DB containing 2 docs with attachments of 100mb and 500mb caused the Erlang VM to consume near 1.2GB of ram memory in my system. With that patch applied, it uses about 130Mb of ram memory.

          Show
          Filipe Manana added a comment - At the moment, for compressed attachments, the replication uncompresses and then compresses again the attachments. Therefore, a waste of CPU time. The push replication is also not reliable for very large attachments (500mb + for example). Currently it sends the attachments in-lined in the respective JSON doc. Not only this requires too much ram memory, it also wastes too much CPU time doing the base64 encoding of the attachment (and also a decompression if the attachment is compressed). The following patch (rep-att-comp-and-multipart-trunk*.patch) addresses both issues. Docs containing attachments are now streamed to the target remote DB using the multipart doc streaming feature provided by couch_doc.erl, and compressed attachments are not uncompressed and re-compressed during the replication JavaScript tests included. Previously doing a replication of a DB containing 2 docs with attachments of 100mb and 500mb caused the Erlang VM to consume near 1.2GB of ram memory in my system. With that patch applied, it uses about 130Mb of ram memory.
          Hide
          Filipe Manana added a comment -

          Anyone looking into this?

          This also fixes the issue of COUCHDB-163 as far as I understood.

          cheers

          Show
          Filipe Manana added a comment - Anyone looking into this? This also fixes the issue of COUCHDB-163 as far as I understood. cheers
          Hide
          Chris Anderson added a comment -

          This patch applies cleanly and the tests are passing. I'm also +1 on the feature (and I sure wouldn't mind committing this before 0.11 is tarballed as the code changes are enough that it might make backporting fixes to 0.11 a pain later on.)

          However, I'm not 100% sure about _bulk_docs_rep.

          I'm concerned about having a separate endpoint designed for replication (gives the wrong idea to people – that replication is special. Replication is just another HTTP client.)

          I'm also concerned about the implementation (does this copy only new attachments, or does it copy all attachments?) I'd like it of Adam or someone else familiar with the replicator could review this patch. (And apply it if you think it is right.)

          Show
          Chris Anderson added a comment - This patch applies cleanly and the tests are passing. I'm also +1 on the feature (and I sure wouldn't mind committing this before 0.11 is tarballed as the code changes are enough that it might make backporting fixes to 0.11 a pain later on.) However, I'm not 100% sure about _bulk_docs_rep. I'm concerned about having a separate endpoint designed for replication (gives the wrong idea to people – that replication is special. Replication is just another HTTP client.) I'm also concerned about the implementation (does this copy only new attachments, or does it copy all attachments?) I'd like it of Adam or someone else familiar with the replicator could review this patch. (And apply it if you think it is right.)
          Hide
          Filipe Manana added a comment -

          Hi Chris,

          That is in fact the part I don't like, exposing the _bulk_doc_rep. I did it because when using the doc multipart streamer, we can't use the same http body to include other docs (at least not as far as I know). So _bulk_docs would be no longer _bulk_docs but _bulk_doc (singular).

          The alternative I see, is to add a case clause in _bulk_docs, like:

          case HttpHeaderContentType of
          "multipart/related" ->
          % do the stuff of _bulk_doc_rep (new_edits is false, call update_docs with "replicated_changes")
          _Else ->
          % ....
          end

          This probably, looks better?

          It should copy new and old attachments (doesn't matter if they're compressed or not). Hummm, what is there suspicious about that?

          cheers

          Show
          Filipe Manana added a comment - Hi Chris, That is in fact the part I don't like, exposing the _bulk_doc_rep. I did it because when using the doc multipart streamer, we can't use the same http body to include other docs (at least not as far as I know). So _bulk_docs would be no longer _bulk_docs but _bulk_doc (singular). The alternative I see, is to add a case clause in _bulk_docs, like: case HttpHeaderContentType of "multipart/related" -> % do the stuff of _bulk_doc_rep (new_edits is false, call update_docs with "replicated_changes") _Else -> % .... end This probably, looks better? It should copy new and old attachments (doesn't matter if they're compressed or not). Hummm, what is there suspicious about that? cheers
          Hide
          Filipe Manana added a comment -

          Ok, now I'm no longer using a replication specific URI API. Definitely, it was a bad idea.

          For multipart docs, I simply use now a PUT /somedb/docId?new_edits=false. This is an API that exists already. Dunno why, but previously I was associating new_edits to _bulk_docs only.

          So, for docs without attachments, I upload them to the remote target DB using _bulk_docs, exactly like before. For docs with attachments, I upload them using PUT /somedb/docId?new_edits=false and sending the doc as a multipart stream.

          Simple enough imho.

          Show
          Filipe Manana added a comment - Ok, now I'm no longer using a replication specific URI API. Definitely, it was a bad idea. For multipart docs, I simply use now a PUT /somedb/docId?new_edits=false. This is an API that exists already. Dunno why, but previously I was associating new_edits to _bulk_docs only. So, for docs without attachments, I upload them to the remote target DB using _bulk_docs, exactly like before. For docs with attachments, I upload them using PUT /somedb/docId?new_edits=false and sending the doc as a multipart stream. Simple enough imho.
          Hide
          Filipe Manana added a comment -

          Just a minor update in case errors are found when replicating docs.

          Show
          Filipe Manana added a comment - Just a minor update in case errors are found when replicating docs.
          Hide
          Filipe Manana added a comment -

          Just added an Etap test and removed the JS tests section relative to this ticket.

          With the Etap test it's more reliable to detect if the replicated attachments were in fact transfered in compressed form, since with Firefox we can't control the value of the header "Accept-Encoding"

          @Chris Is it all ok now with this patch?

          @Adam Any feedback?

          Or feedback from anyone else.

          cheers

          Show
          Filipe Manana added a comment - Just added an Etap test and removed the JS tests section relative to this ticket. With the Etap test it's more reliable to detect if the replicated attachments were in fact transfered in compressed form, since with Firefox we can't control the value of the header "Accept-Encoding" @Chris Is it all ok now with this patch? @Adam Any feedback? Or feedback from anyone else. cheers
          Hide
          sulantha sanjeewa added a comment -

          Can any one please tell me how to install this patch.. ASAP please.. i have version 0.11.. it says it can't find the file

          Show
          sulantha sanjeewa added a comment - Can any one please tell me how to install this patch.. ASAP please.. i have version 0.11.. it says it can't find the file
          Hide
          Filipe Manana added a comment -

          Can't find which file?

          You should do:

          $ cd your_git_repo_path
          $ git apply rep-att-comp-and-multipart-trunk-4.patch

          Show
          Filipe Manana added a comment - Can't find which file? You should do: $ cd your_git_repo_path $ git apply rep-att-comp-and-multipart-trunk-4.patch
          Hide
          Filipe Manana added a comment -

          Just tested it with the latest trunk rev and found no problems:

          fdmanana@core2duo:~/git/couchdb$ git log -1
          commit 4cefde131f1992c70f66c527435a715290174423
          Author: Mark Hammond <mhammond@apache.org>
          Date: Fri Feb 26 01:32:21 2010 +0000

          generate .sha file for windows binary; ensure md5/sha use rel paths

          git-svn-id: https://svn.apache.org/repos/asf/couchdb/trunk@916528 13f79535-47bb-0310-9956-ffa450edef68
          fdmanana@core2duo:~/git/couchdb$

          fdmanana@core2duo:~/git/couchdb$ git apply --index --reject ../rep-att-comp-and-multipart-trunk-4.patch
          Checking patch src/couchdb/couch_db.erl...
          Checking patch src/couchdb/couch_doc.erl...
          Checking patch src/couchdb/couch_httpd_db.erl...
          Checking patch src/couchdb/couch_rep_att.erl...
          Checking patch src/couchdb/couch_rep_reader.erl...
          Checking patch src/couchdb/couch_rep_writer.erl...
          Checking patch test/etap/170-replication-attachment-comp.t...
          Checking patch test/etap/Makefile.am...
          Applied patch src/couchdb/couch_db.erl cleanly.
          Applied patch src/couchdb/couch_doc.erl cleanly.
          Applied patch src/couchdb/couch_httpd_db.erl cleanly.
          Applied patch src/couchdb/couch_rep_att.erl cleanly.
          Applied patch src/couchdb/couch_rep_reader.erl cleanly.
          Applied patch src/couchdb/couch_rep_writer.erl cleanly.
          Applied patch test/etap/170-replication-attachment-comp.t cleanly.
          Applied patch test/etap/Makefile.am cleanly.
          fdmanana@core2duo:~/git/couchdb$

          All tests are passing also.

          Show
          Filipe Manana added a comment - Just tested it with the latest trunk rev and found no problems: fdmanana@core2duo:~/git/couchdb$ git log -1 commit 4cefde131f1992c70f66c527435a715290174423 Author: Mark Hammond <mhammond@apache.org> Date: Fri Feb 26 01:32:21 2010 +0000 generate .sha file for windows binary; ensure md5/sha use rel paths git-svn-id: https://svn.apache.org/repos/asf/couchdb/trunk@916528 13f79535-47bb-0310-9956-ffa450edef68 fdmanana@core2duo:~/git/couchdb$ fdmanana@core2duo:~/git/couchdb$ git apply --index --reject ../rep-att-comp-and-multipart-trunk-4.patch Checking patch src/couchdb/couch_db.erl... Checking patch src/couchdb/couch_doc.erl... Checking patch src/couchdb/couch_httpd_db.erl... Checking patch src/couchdb/couch_rep_att.erl... Checking patch src/couchdb/couch_rep_reader.erl... Checking patch src/couchdb/couch_rep_writer.erl... Checking patch test/etap/170-replication-attachment-comp.t... Checking patch test/etap/Makefile.am... Applied patch src/couchdb/couch_db.erl cleanly. Applied patch src/couchdb/couch_doc.erl cleanly. Applied patch src/couchdb/couch_httpd_db.erl cleanly. Applied patch src/couchdb/couch_rep_att.erl cleanly. Applied patch src/couchdb/couch_rep_reader.erl cleanly. Applied patch src/couchdb/couch_rep_writer.erl cleanly. Applied patch test/etap/170-replication-attachment-comp.t cleanly. Applied patch test/etap/Makefile.am cleanly. fdmanana@core2duo:~/git/couchdb$ All tests are passing also.
          Hide
          Filipe Manana added a comment -

          A 1 line change. Added missing call to couch_util:url_encode/1 with a doc id as the parameter.

          Show
          Filipe Manana added a comment - A 1 line change. Added missing call to couch_util:url_encode/1 with a doc id as the parameter.
          Hide
          Filipe Manana added a comment -

          Just keeping the patch up to date with r917608

          Show
          Filipe Manana added a comment - Just keeping the patch up to date with r917608
          Hide
          Filipe Manana added a comment -

          Updated the patch according to Adam's review (through IRC)

          1) use lists:partition instead of lists:foldl in couch_rep_writer
          2) rename test case file from 170-* to 113-*

          cheers

          Show
          Filipe Manana added a comment - Updated the patch according to Adam's review (through IRC) 1) use lists:partition instead of lists:foldl in couch_rep_writer 2) rename test case file from 170-* to 113-* cheers
          Hide
          Adam Kocoloski added a comment -

          Great patch, Filipe. Thanks!

          Show
          Adam Kocoloski added a comment - Great patch, Filipe. Thanks!

            People

            • Assignee:
              Unassigned
              Reporter:
              Filipe Manana
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development