CouchDB
  1. CouchDB
  2. COUCHDB-687

Add md5 hash to _attachments properties for documents

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.1
    • Component/s: None
    • Labels:
      None
    • Environment:

      CouchDB

    • Skill Level:
      Regular Contributors Level (Easy to Medium)

      Description

      The current attachment information looks like this:

      GET /dbname/docid

      "_attachments": {
      "jquery-1.4.1.min.js":

      { "content_type": "text/javascript" "revpos": 138 "length": 70844 "stub": true }

      }

      If a client wanted to sync local files as attachments with a document it would not currently be able to do so without keeping a local store of the revpos. If this information included an md5 hash of the attachment clients could compare it against a hash of the local file to see if they match.

      -Mikeal

      1. md5.patch
        1 kB
        Juuso Väänänen
      2. couchdb-md5-in-attachment-COUCHDB-687.patch
        13 kB
        Juuso Väänänen
      3. couchdb-md5-in-attachment-COUCHDB-687-v2.patch
        16 kB
        Juuso Väänänen
      4. couchdb-md5-in-attachment-COUCHDB-687-v3.patch
        17 kB
        Juuso Väänänen

        Issue Links

          Activity

          Hide
          Robert Newson added a comment -

          This happened in COUCHDB-1173

          Show
          Robert Newson added a comment - This happened in COUCHDB-1173
          Hide
          Robert Newson added a comment -

          +1 on reporting the md5 we've calculated for this ticket. Showing the other md5, if its different, is a new ticket (and one I'm not that keen on).

          This strikes me as very wrong;

          "CouchDB needs to be transparent in how it's creating revision identifiers."

          I prefer;

          "CouchDB needs to be completely opaque in how it's creating revision identifiers."

          Show
          Robert Newson added a comment - +1 on reporting the md5 we've calculated for this ticket. Showing the other md5, if its different, is a new ticket (and one I'm not that keen on). This strikes me as very wrong; "CouchDB needs to be transparent in how it's creating revision identifiers." I prefer; "CouchDB needs to be completely opaque in how it's creating revision identifiers."
          Hide
          Filipe Manana added a comment -

          Randall, that seems a whole new problem that was not in the original scope of this ticket. I would suggest moving it into another ticket. Also, there's an ?att_encoding_info=true parameter which gives information about attachment encoding.

          Let's just stick to the original goal of exposing the md5 digest of an attachment in its identity form.

          Show
          Filipe Manana added a comment - Randall, that seems a whole new problem that was not in the original scope of this ticket. I would suggest moving it into another ticket. Also, there's an ?att_encoding_info=true parameter which gives information about attachment encoding. Let's just stick to the original goal of exposing the md5 digest of an attachment in its identity form.
          Hide
          Randall Leeds added a comment -

          +1 as well for having the digest_type in the metadata.

          I want to suggest that we also include the encoding/compression in the metadata about an attachment. This information has implications for deterministic revision generation since the md5 of attachments on disk is used in the generation of revisions.

          For purity of API and transparency of revision generation I think it's unacceptable for CouchDB to compute a revision ID based on the digest of an attachment which was compressed server-side. A client needs to expect that an attachment uploaded to identical documents in two different couches using the same encoding should result in the same revision. If a client wants to pre-compress an attachment to upload, and inform couch of the compression using headers, it should be fine for couch to calculate a digest using the compressed version (and use that in the revision generation) as long as the compression/encoding format on which the digest is based is exposed as well.

          tl;dr – CouchDB needs to be transparent in how it's creating revision identifiers. It should NEVER use a digest generated after server-side compression to calculate a revision hash. It MUST calculate the revision from the data as provided by the client. These are the considerations I have approaching this patch.

          Show
          Randall Leeds added a comment - +1 as well for having the digest_type in the metadata. I want to suggest that we also include the encoding/compression in the metadata about an attachment. This information has implications for deterministic revision generation since the md5 of attachments on disk is used in the generation of revisions. For purity of API and transparency of revision generation I think it's unacceptable for CouchDB to compute a revision ID based on the digest of an attachment which was compressed server-side. A client needs to expect that an attachment uploaded to identical documents in two different couches using the same encoding should result in the same revision. If a client wants to pre-compress an attachment to upload, and inform couch of the compression using headers, it should be fine for couch to calculate a digest using the compressed version (and use that in the revision generation) as long as the compression/encoding format on which the digest is based is exposed as well. tl;dr – CouchDB needs to be transparent in how it's creating revision identifiers. It should NEVER use a digest generated after server-side compression to calculate a revision hash. It MUST calculate the revision from the data as provided by the client. These are the considerations I have approaching this patch.
          Hide
          Paul Joseph Davis added a comment -

          @Jason

          We need to transfer knowledge of what hash is used, but having it in the name leaves you with stuff like this:

          http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.15

          I'm +1 for Filipe's digest and digest_type fields.

          Show
          Paul Joseph Davis added a comment - @Jason We need to transfer knowledge of what hash is used, but having it in the name leaves you with stuff like this: http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.15 I'm +1 for Filipe's digest and digest_type fields.
          Hide
          Filipe Manana added a comment -

          Thanks for bringing this up Jason.
          It has been on my list for applying for too long really.
          Probably just changing the name of the new attachment field attribute from "md5" to "digest" and adding a "digest_type": "md5" will be ok.

          Juuso, do you want to update the patch?

          Show
          Filipe Manana added a comment - Thanks for bringing this up Jason. It has been on my list for applying for too long really. Probably just changing the name of the new attachment field attribute from "md5" to "digest" and adding a "digest_type": "md5" will be ok. Juuso, do you want to update the patch?
          Hide
          Jason Smith added a comment -

          Recently, I needed to move/rename dozens of attachments in a document. I had hoped to shuffle the _attachments stubs around but that did not work. So I'd like to see lightweight attachment modifications using the stubs. (Currently the stubs can only create and delete.)

          Anyway, having checksums visible in the stubs seems like it would at least make attachment moving possible.

          And Paul, from COUCHDB-1170, it seems like if you reserve the right to change the checksum algorithm, then that's when you should indicate the algorithm used.

          Show
          Jason Smith added a comment - Recently, I needed to move/rename dozens of attachments in a document. I had hoped to shuffle the _attachments stubs around but that did not work. So I'd like to see lightweight attachment modifications using the stubs. (Currently the stubs can only create and delete.) Anyway, having checksums visible in the stubs seems like it would at least make attachment moving possible. And Paul, from COUCHDB-1170 , it seems like if you reserve the right to change the checksum algorithm, then that's when you should indicate the algorithm used.
          Hide
          Juuso Väänänen added a comment -
          • Formatting, line length and unnecessary changes fixed (mostly).
            Some lines still might exist with 81 chars or so because of better readablility. How javascript should be intented, 2 spaces?
          • Added one test for md5 after compaction to compact.js.
          • Changed hex_to_bin/1 to allow also list as input.
          Show
          Juuso Väänänen added a comment - Formatting, line length and unnecessary changes fixed (mostly). Some lines still might exist with 81 chars or so because of better readablility. How javascript should be intented, 2 spaces? Added one test for md5 after compaction to compact.js. Changed hex_to_bin/1 to allow also list as input.
          Hide
          Paul Joseph Davis added a comment -

          I'm saying that our case clauses are not indented consistently. The question is if the term that's pattern matched should be indented a level or not.

          Also, emacs is for weenies.

          Show
          Paul Joseph Davis added a comment - I'm saying that our case clauses are not indented consistently. The question is if the term that's pattern matched should be indented a level or not. Also, emacs is for weenies.
          Hide
          Randall Leeds added a comment -

          Paul:
          Do you mean the current formatting is a bit weird? I fight with erlang-mode constantly to get the case style we use.

          Show
          Randall Leeds added a comment - Paul: Do you mean the current formatting is a bit weird? I fight with erlang-mode constantly to get the case style we use.
          Hide
          Filipe Manana added a comment -

          Yes, I think Jira took the 4 spaces indentation in the example.

          Show
          Filipe Manana added a comment - Yes, I think Jira took the 4 spaces indentation in the example.
          Hide
          Paul Joseph Davis added a comment -

          For point 4, is it just me or is the formatting a bit weird?

          Also, I wouldn't say that having a case clause on a single line is necessarily bad as long as its consistent across all clauses in that case statement.

          Show
          Paul Joseph Davis added a comment - For point 4, is it just me or is the formatting a bit weird? Also, I wouldn't say that having a case clause on a single line is necessarily bad as long as its consistent across all clauses in that case statement.
          Hide
          Filipe Manana added a comment -

          Once again thanks Juuso.

          My comments:

          1) Except maybe for Window's notepad, most text editors allow you define a 80 characters right margin - this helps you identify which lines exceed the limit. I still see several in the patch. I know there's existing code which exceeds the 80 characters rule, but don't consider it as an example to follow;

          2) Don't forget to test the md5 attribute remains the same after database compaction;

          3) Avoid changing necessary lines like the following for e.g.:

          • ] ++
            + ]++

          Or the removal of blank lines like in line 316 of the diff;

          4) Try to make the 'case' indentation style the same like in the surrounding code. E.g.:

          + IdentityMd5 = case HexIdentityMd5 of
          + undefined -> undefined;
          + _ -> couch_util:hex_to_bin(HexIdentityMd5)
          + end,

          Should be:

          + IdentityMd5 = case HexIdentityMd5 of
          + undefined ->
          + undefined;
          + _ ->
          + couch_util:hex_to_bin(HexIdentityMd5)
          + end,

          5) As for the hex_to_bin function, it's better to allow the parameter to be a list as well, instead of assuming it's always a binary:

          +hex_to_bin(B) when is_binary(B) ->
          + hex_to_bin(?b2l(B));
          +hex_to_bin(B) when is_list(B) ->
          + hex_to_bin(B, []);
          +
          +hex_to_bin([], Acc) ->
          + ?l2b(lists:reverse(Acc));
          +hex_to_bin([X,Y|T], Acc) ->
          +

          {ok, [V], []}

          = io_lib:fread("~16u", [X,Y]),
          + hex_to_bin(T, [V | Acc]).

          Also note that it's not following the 4 spaces indentation convention we follow in CouchDB.

          6) Here for e.g.:

          + OutIdentityMd5 = case IdentityMd5 of
          + undefined -> couch_util:md5(Bin);
          + _ -> IdentityMd5
          + end,

          I see inconsistent mix of spaces and tabs. Most editors allow you to automatically replace tabs with N spaces.

          7) Here:

          + OutIdentityMd5 = case

          {InIdentityMd5, Enc, NewEnc}

          of
          +

          {undefined,gzip,_}

          ->
          + % gzipped attachment without incoming value should not use anything
          + undefined;
          + {,identity,} ->
          + % new attachment should use calculated md5, unless attachment is
          + % compressed, updated attachment should use calculated md5 insted of
          + % existing one
          + IdentityMd5;
          + {,gzip,} ->
          + % compressed attachment sent through replication api should use
          + % incoming md5 when it's defined
          + InIdentityMd5
          + end,

          Why the 3 tuple element? In all the case clauses you don't care about the third element, so just drop it.

          Everything else seems fine.
          regards

          Show
          Filipe Manana added a comment - Once again thanks Juuso. My comments: 1) Except maybe for Window's notepad, most text editors allow you define a 80 characters right margin - this helps you identify which lines exceed the limit. I still see several in the patch. I know there's existing code which exceeds the 80 characters rule, but don't consider it as an example to follow; 2) Don't forget to test the md5 attribute remains the same after database compaction; 3) Avoid changing necessary lines like the following for e.g.: ] ++ + ]++ Or the removal of blank lines like in line 316 of the diff; 4) Try to make the 'case' indentation style the same like in the surrounding code. E.g.: + IdentityMd5 = case HexIdentityMd5 of + undefined -> undefined; + _ -> couch_util:hex_to_bin(HexIdentityMd5) + end, Should be: + IdentityMd5 = case HexIdentityMd5 of + undefined -> + undefined; + _ -> + couch_util:hex_to_bin(HexIdentityMd5) + end, 5) As for the hex_to_bin function, it's better to allow the parameter to be a list as well, instead of assuming it's always a binary: +hex_to_bin(B) when is_binary(B) -> + hex_to_bin(?b2l(B)); +hex_to_bin(B) when is_list(B) -> + hex_to_bin(B, []); + +hex_to_bin([], Acc) -> + ?l2b(lists:reverse(Acc)); +hex_to_bin( [X,Y|T] , Acc) -> + {ok, [V], []} = io_lib:fread("~16u", [X,Y] ), + hex_to_bin(T, [V | Acc] ). Also note that it's not following the 4 spaces indentation convention we follow in CouchDB. 6) Here for e.g.: + OutIdentityMd5 = case IdentityMd5 of + undefined -> couch_util:md5(Bin); + _ -> IdentityMd5 + end, I see inconsistent mix of spaces and tabs. Most editors allow you to automatically replace tabs with N spaces. 7) Here: + OutIdentityMd5 = case {InIdentityMd5, Enc, NewEnc} of + {undefined,gzip,_} -> + % gzipped attachment without incoming value should not use anything + undefined; + { ,identity, } -> + % new attachment should use calculated md5, unless attachment is + % compressed, updated attachment should use calculated md5 insted of + % existing one + IdentityMd5; + { ,gzip, } -> + % compressed attachment sent through replication api should use + % incoming md5 when it's defined + InIdentityMd5 + end, Why the 3 tuple element? In all the case clauses you don't care about the third element, so just drop it. Everything else seems fine. regards
          Hide
          Juuso Väänänen added a comment -

          Thanks for comments, here is patch with some cleanups and improvements done.

          Major changes:

          • Simple javascript tests added, etap tests contain testing also to some extent.
          • Md5 is calculated from attachement data when base64 is used to upload data.
          • md5 is omitted from json output when value is undefined.
          • Try to keep line length on 80 chars.
          • Upgrade code not fully tested (no tests included), but revised to git master.

          Some questions regarding functionality:

          • Should md5 be calculated when gzipped attachement is uploaded without identity md5?
          Show
          Juuso Väänänen added a comment - Thanks for comments, here is patch with some cleanups and improvements done. Major changes: Simple javascript tests added, etap tests contain testing also to some extent. Md5 is calculated from attachement data when base64 is used to upload data. md5 is omitted from json output when value is undefined. Try to keep line length on 80 chars. Upgrade code not fully tested (no tests included), but revised to git master. Some questions regarding functionality: Should md5 be calculated when gzipped attachement is uploaded without identity md5?
          Hide
          Filipe Manana added a comment -

          Hi Juuso,

          Again, thanks for your efforts.

          I haven't tried the patch, however it looks good in general.
          My comments bellow.

          1) To accept it, I need tests. Test the md5 property for compressible and non-compressible attachments. Test that after a compaction the md5 attribute is still there and with the same value as before the compaction. Test that after every kind of replication (local-local, local-remote, remote-remote, remote-local), the md5 attribute has the same value on the target. Etc. I would add the tests in attachments.js and eventually replication.js.

          2) I think it's a bad idea to give it the string value "undefined" when we don't have the attachments' identity md5 (why "undefined" btw?). Simply omitting it from the JSON is the way to go (or at the very least assign it the null value).

          3) Here:

          +to_md5_hex(Binary) ->
          + ?l2b(lists:flatten(couch_util:to_hex(Binary))).

          Why the lists:flatten call? Afaik, couch_util:to_hex/1 always returns a string. I would also avoid declaring this function, since it's only used in one place. Simply replace this function call with a direct call to ?l2b(couch_util:to_hex(Bin)).

          4) The hex_to_bin/1 function should probably go to couch_util.

          5) Here:

          + try [

          {<<"md5">>, to_md5_hex(Att#att.identity_md5)}

          ]
          + catch error: _ -> [

          {<<"md5">>, <<"undefined">>}

          ] end++

          Why the try catch expression? Afaik couch_util:to_hex/1 never throws an exception, even if the argument is an empty binary.

          6) I told in your previous submission (other ticket) to keep lines up to 80 characters wide. See http://wiki.apache.org/couchdb/Coding_Standards

          Show
          Filipe Manana added a comment - Hi Juuso, Again, thanks for your efforts. I haven't tried the patch, however it looks good in general. My comments bellow. 1) To accept it, I need tests. Test the md5 property for compressible and non-compressible attachments. Test that after a compaction the md5 attribute is still there and with the same value as before the compaction. Test that after every kind of replication (local-local, local-remote, remote-remote, remote-local), the md5 attribute has the same value on the target. Etc. I would add the tests in attachments.js and eventually replication.js. 2) I think it's a bad idea to give it the string value "undefined" when we don't have the attachments' identity md5 (why "undefined" btw?). Simply omitting it from the JSON is the way to go (or at the very least assign it the null value). 3) Here: +to_md5_hex(Binary) -> + ?l2b(lists:flatten(couch_util:to_hex(Binary))). Why the lists:flatten call? Afaik, couch_util:to_hex/1 always returns a string. I would also avoid declaring this function, since it's only used in one place. Simply replace this function call with a direct call to ?l2b(couch_util:to_hex(Bin)). 4) The hex_to_bin/1 function should probably go to couch_util. 5) Here: + try [ {<<"md5">>, to_md5_hex(Att#att.identity_md5)} ] + catch error: _ -> [ {<<"md5">>, <<"undefined">>} ] end++ Why the try catch expression? Afaik couch_util:to_hex/1 never throws an exception, even if the argument is an empty binary. 6) I told in your previous submission (other ticket) to keep lines up to 80 characters wide. See http://wiki.apache.org/couchdb/Coding_Standards
          Hide
          Juuso Väänänen added a comment -

          Second try storing md5 calculated from attachement when attachement added to file. Replication uses provided identity-md5 if such exists. Compressed attachements are not extracted to find out what actual md5 would be. If md5 is not known for some reason text value "unknown" is used on json.

          Upgrade path from older disk formats is not most likely complete and this patch is at the moment only request for comments. (Anyway granted for inclusion)

          Show
          Juuso Väänänen added a comment - Second try storing md5 calculated from attachement when attachement added to file. Replication uses provided identity-md5 if such exists. Compressed attachements are not extracted to find out what actual md5 would be. If md5 is not known for some reason text value "unknown" is used on json. Upgrade path from older disk formats is not most likely complete and this patch is at the moment only request for comments. (Anyway granted for inclusion)
          Hide
          Filipe Manana added a comment -

          Hi Juuso,

          thanks for the patch.

          Comments:

          1) Why not use couch:util:to_hex/1?

          2) The md5 field of the #att record is not always what you want. This is because for attachments that were stored in compressed form, the #att.md5 field corresponds to the md5 hash of the attachment after compression. The identity md5 hash is given by couch_stream:close/1, but it's not stored in the #att record (nor any other place). It's only used for attachment integrity (if the upload request has the header "Content-MD5")

          Show
          Filipe Manana added a comment - Hi Juuso, thanks for the patch. Comments: 1) Why not use couch:util:to_hex/1? 2) The md5 field of the #att record is not always what you want. This is because for attachments that were stored in compressed form, the #att.md5 field corresponds to the md5 hash of the attachment after compression. The identity md5 hash is given by couch_stream:close/1, but it's not stored in the #att record (nor any other place). It's only used for attachment integrity (if the upload request has the header "Content-MD5")
          Hide
          Juuso Väänänen added a comment -

          Expose md5 of attachement as hex string on json of document.

          Show
          Juuso Väänänen added a comment - Expose md5 of attachement as hex string on json of document.
          Hide
          Filipe Manana added a comment -

          That seems useful to me as well. Therefore I vote +1 on it.

          If nobody is already working on a patch for this, I volunteer to do it. Most of the code that needs to be touched can be found in the patch for ticket CouchDB-583.

          cheers

          Show
          Filipe Manana added a comment - That seems useful to me as well. Therefore I vote +1 on it. If nobody is already working on a patch for this, I volunteer to do it. Most of the code that needs to be touched can be found in the patch for ticket CouchDB-583. cheers

            People

            • Assignee:
              Filipe Manana
              Reporter:
              mikeal
            • Votes:
              3 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development