CouchDB
  1. CouchDB
  2. COUCHDB-558

Validate Content-MD5 request headers on uploads

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.11
    • Labels:
      None

      Description

      We could detect in-flight data corruption if a client sends a Content-MD5 header along with the data and Couch validates the MD5 on arrival.

      RFC1864 - The Content-MD5 Header Field
      http://www.faqs.org/rfcs/rfc1864.html

      1. jira-couchdb-558-10th-try-trunk.patch
        14 kB
        Filipe Manana
      2. jira-couchdb-558-9th-try-trunk.patch
        14 kB
        Filipe Manana
      3. jira-couchdb-558-8th-try-trunk.patch
        15 kB
        Filipe Manana
      4. jira-couchdb-558-7th-try-trunk.patch
        14 kB
        Filipe Manana
      5. jira-couchdb-558-6th-try-trunk.patch
        17 kB
        Filipe Manana
      6. jira-couchdb-558-5th-try.patch
        17 kB
        Filipe Manana
      7. jira-couchdb-558-4th-try.patch
        15 kB
        Filipe Manana
      8. run.tpl.patch
        0.3 kB
        Filipe Manana
      9. jira-couchdb-558-for-trunk-3rd-try.patch
        10 kB
        Filipe Manana
      10. jira-couchdb-558-for-trunk-2nd-try.patch
        10 kB
        Filipe Manana
      11. jira-couchdb-558-for-trunk.patch
        10 kB
        Filipe Manana

        Activity

        Hide
        Robert Newson added a comment -

        This would be very useful but I think there will be pushback for choosing MD5 (Yes, it's not broken for this usage but try educating PHB's). A database creation time configuration for other digest algorithms would be cool, though I appreciate Erlang is limited in its choice (the only other choice is sha1, iirc?)

        Show
        Robert Newson added a comment - This would be very useful but I think there will be pushback for choosing MD5 (Yes, it's not broken for this usage but try educating PHB's). A database creation time configuration for other digest algorithms would be cool, though I appreciate Erlang is limited in its choice (the only other choice is sha1, iirc?)
        Hide
        Adam Kocoloski added a comment -

        Bah, what a shame. I think we ought to go ahead and implement Content-MD5 anyway – we can always add other algorithms in addition to it. And I believe you're right, sha1 is the only other choice in the crypto module.

        Show
        Adam Kocoloski added a comment - Bah, what a shame. I think we ought to go ahead and implement Content-MD5 anyway – we can always add other algorithms in addition to it. And I believe you're right, sha1 is the only other choice in the crypto module.
        Hide
        Robert Newson added a comment -

        MD5 is fine for guarding against transmission errors and checking against corruption. That is, the current uses are just fine, imo. I do expect to have to convince my PHB that using MD5 is fine (we currently use SHA256 for essentially the same thing), but that's my sorrow, not yours.

        IIRC, I believe S3 uses this header (and therefore MD5) for the same purpose.

        Show
        Robert Newson added a comment - MD5 is fine for guarding against transmission errors and checking against corruption. That is, the current uses are just fine, imo. I do expect to have to convince my PHB that using MD5 is fine (we currently use SHA256 for essentially the same thing), but that's my sorrow, not yours. IIRC, I believe S3 uses this header (and therefore MD5) for the same purpose.
        Hide
        Damien Katz added a comment -

        Robert is correct, MD5 is fine for validating integrity, and as far as I know, it's the only hash function that's standardized in HTTP.

        For fully secure, unspoofable transmission, SSL is the way to go anyway.

        Show
        Damien Katz added a comment - Robert is correct, MD5 is fine for validating integrity, and as far as I know, it's the only hash function that's standardized in HTTP. For fully secure, unspoofable transmission, SSL is the way to go anyway.
        Hide
        Filipe Manana added a comment -

        Hello,

        I would like to have some feedback about this patch. I am learning Erlang, as well as CouchDB's implementation, so this is a good exercise for me and, hopefully, a good contribution to CouchDB.

        JavaScript test case is included.

        A suggestion:

        Don't you agree it would be also interesting to have a config option for CouchDB to include MD5 digests for GET replies? This way the client side could validate the integrity of the messages received from CouchDB.

        best regards,
        Filipe Manana

        Show
        Filipe Manana added a comment - Hello, I would like to have some feedback about this patch. I am learning Erlang, as well as CouchDB's implementation, so this is a good exercise for me and, hopefully, a good contribution to CouchDB. JavaScript test case is included. A suggestion: Don't you agree it would be also interesting to have a config option for CouchDB to include MD5 digests for GET replies? This way the client side could validate the integrity of the messages received from CouchDB. best regards, Filipe Manana
        Hide
        Filipe Manana added a comment -

        Removed some debug log statements from the first patch as well as minor garbage used for my own debugging.

        cheers

        Show
        Filipe Manana added a comment - Removed some debug log statements from the first patch as well as minor garbage used for my own debugging. cheers
        Hide
        Paul Joseph Davis added a comment -

        Filipe,

        Overall it looks pretty good, but a couple points:

        • We should probably be checking the Content-MD5 before we process the request.
        • check_integrity should probably throw an error or return the body
        • You should still be recording stats even when validation fails
        • There are alot of variable assignments where they aren't necessary.
        • keep lines less than 80 characters
        • The perl in the JS test is ungood. Either paste it here, or better, use an Erlang test to check.
        • really_long_function_names_are_hard_to_read - The functions for trailers could be made more generic.
        • The check for Content-MD5 appears to be case sensitive
        • get_has_from_trailer is weird... oh its an rstrip. There's probably a better way to do that. Try referencing how Mochiweb parses its headers.

        I think the idea is pretty solid. The only thing I'm a bit concerned about is the trailier parsing. The current bits are a bit awkard. In a perfect world id prefer to see that as a patch to mochiweb, but having it in CouchDB is fine if they rejected that patch or during the time it takes to get into upstream.

        Also, if you take the approach of having your hash matching function just throw an error that will get caught by the try/catch around the HandleReq() call, it should simplify a whole lot of this.

        If none of that makes any sense let me know and I'll refactor the patch locally to try and explain things more concretely.

        Show
        Paul Joseph Davis added a comment - Filipe, Overall it looks pretty good, but a couple points: We should probably be checking the Content-MD5 before we process the request. check_integrity should probably throw an error or return the body You should still be recording stats even when validation fails There are alot of variable assignments where they aren't necessary. keep lines less than 80 characters The perl in the JS test is ungood. Either paste it here, or better, use an Erlang test to check. really_long_function_names_are_hard_to_read - The functions for trailers could be made more generic. The check for Content-MD5 appears to be case sensitive get_has_from_trailer is weird... oh its an rstrip. There's probably a better way to do that. Try referencing how Mochiweb parses its headers. I think the idea is pretty solid. The only thing I'm a bit concerned about is the trailier parsing. The current bits are a bit awkard. In a perfect world id prefer to see that as a patch to mochiweb, but having it in CouchDB is fine if they rejected that patch or during the time it takes to get into upstream. Also, if you take the approach of having your hash matching function just throw an error that will get caught by the try/catch around the HandleReq() call, it should simplify a whole lot of this. If none of that makes any sense let me know and I'll refactor the patch locally to try and explain things more concretely.
        Hide
        Filipe Manana added a comment -

        Hi Paul,

        I do agree.

        Regarding the comment of fhe first bullet point "We should probably be checking the Content-MD5 before we process the request".

        Isn't this the case? I mean, HandlerFun(Req) is invoked only if check_integrity() returns

        {ok, _}

        . Also, if the check is successful, check_integrity returns a new #httpd record with the req_body field set, therefore any subsequent calls to body() will use that new #httpd record.

        I'll change my code to follow your suggestions.
        I'll submit another patch in a few days.

        Thanks for the feedback

        Show
        Filipe Manana added a comment - Hi Paul, I do agree. Regarding the comment of fhe first bullet point "We should probably be checking the Content-MD5 before we process the request". Isn't this the case? I mean, HandlerFun(Req) is invoked only if check_integrity() returns {ok, _} . Also, if the check is successful, check_integrity returns a new #httpd record with the req_body field set, therefore any subsequent calls to body() will use that new #httpd record. I'll change my code to follow your suggestions. I'll submit another patch in a few days. Thanks for the feedback
        Hide
        Filipe Manana added a comment -

        Hi again Paul,

        I already improved on some of the points you made:

        • check_integrity should probably throw an error or return the body -> DONE
        • You should still be recording stats even when validation fails -> DONE
        • There are alot of variable assignments where they aren't necessary -> DONE
        • keep lines less than 80 characters -> DONE
        • really_long_function_names_are_hard_to_read - The functions for trailers could be made more generic. -> DONE
        • The check for Content-MD5 appears to be case sensitive -> DONE
        • ...having your hash matching function just throw an error that will get caught by the try/catch around the HandleReq() call -> DONE
        • "the only thing I'm a bit concerned about is the trailier parsing. The current bits are a bit awkard. In a perfect world id prefer to see that as a patch to mochiweb, but having it in CouchDB is fine if they rejected that patch or during the time it takes to get into upstream."

        Well, the read_length(0) function from mochiweb_request.erl is awkard, as it gives us the trailer as a list of binaries. In mochiweb_headers.erl, we can create a Mochiweb Headers structure only from [

        {key(), value()}

        ] lists.
        Therefore, in this patch, I added this little function to couch_httpd.erl:

        %% @spec to_mochiweb_headers([binary()]) -> headers()
        %%
        %% Transforms the given binary list into a Mochiweb
        %% headers structure. Each binary is a raw HTTP header
        %% line (e.g. <<"Content-Lengh: 345\r\n">>).
        %%
        to_mochiweb_headers(BinaryList) ->

        {ok, R}

        = re:compile("^(.?):\s+(.?)\r\n$"),
        F = fun(Bin, Acc) ->

        {match, [_, H, V]}

        = re:run(Bin, R, [

        {capture, all, list}

        ]),
        [

        {H, V}

        | Acc ]
        end,
        mochiweb_headers:make(lists:foldr(F, [], BinaryList)).

        Then I can get values from it like a normal header: "mochiweb_headers:get_value("Content-MD5", Trailer)".
        This is case insensitive.

        I would like to know you point of views for:

        1) I think I'll submit a patch to Mochiweb, where I add that little function to mochiweb_utils.erl or mochiweb_headers.erl.

        2) Look into the self explanatory comment of the function update_req/2 that I added. What do you think?

        I will write the Erlang test suite soon

        thanks

        Best regards,
        Filipe Manana

        Show
        Filipe Manana added a comment - Hi again Paul, I already improved on some of the points you made: check_integrity should probably throw an error or return the body -> DONE You should still be recording stats even when validation fails -> DONE There are alot of variable assignments where they aren't necessary -> DONE keep lines less than 80 characters -> DONE really_long_function_names_are_hard_to_read - The functions for trailers could be made more generic. -> DONE The check for Content-MD5 appears to be case sensitive -> DONE ...having your hash matching function just throw an error that will get caught by the try/catch around the HandleReq() call -> DONE "the only thing I'm a bit concerned about is the trailier parsing. The current bits are a bit awkard. In a perfect world id prefer to see that as a patch to mochiweb, but having it in CouchDB is fine if they rejected that patch or during the time it takes to get into upstream." Well, the read_length(0) function from mochiweb_request.erl is awkard, as it gives us the trailer as a list of binaries. In mochiweb_headers.erl, we can create a Mochiweb Headers structure only from [ {key(), value()} ] lists. Therefore, in this patch, I added this little function to couch_httpd.erl: %% @spec to_mochiweb_headers( [binary()] ) -> headers() %% %% Transforms the given binary list into a Mochiweb %% headers structure. Each binary is a raw HTTP header %% line (e.g. <<"Content-Lengh: 345\r\n">>). %% to_mochiweb_headers(BinaryList) -> {ok, R} = re:compile("^(. ?):\s+(. ?)\r\n$"), F = fun(Bin, Acc) -> {match, [_, H, V]} = re:run(Bin, R, [ {capture, all, list} ]), [ {H, V} | Acc ] end, mochiweb_headers:make(lists:foldr(F, [], BinaryList)). Then I can get values from it like a normal header: "mochiweb_headers:get_value("Content-MD5", Trailer)". This is case insensitive. I would like to know you point of views for: 1) I think I'll submit a patch to Mochiweb, where I add that little function to mochiweb_utils.erl or mochiweb_headers.erl. 2) Look into the self explanatory comment of the function update_req/2 that I added. What do you think? I will write the Erlang test suite soon thanks Best regards, Filipe Manana
        Hide
        Paul Joseph Davis added a comment -

        Filipe,

        This is looking much better. I dunno what I was thinking about the HandleReq earlier.

        I definitely like the new approach to handling the trailiers and so on. Though, this feels like it should be part of mochiweb_request.erl and this code would just call a single function that would automatically do the trailer detection and parsing. Ideally it'd be mostly transparent to something like Req:body() as well.

        The to_mochiweb_headers should probably use an existing mochiweb function for header parsing so that we can blame someone when its not right. I'm pretty sure that regexp would break with white space left of the colon and leaves whitespace after the value.

        The update_req function looks pretty good at first pass. One of the benefits of pushing this upstream is that they'll review it as well to make sure we're not missing anything.

        Hmmm. I just thought, if an attachment request comes in with a content-md5 header, this patch is going to try and buffer that in RAM. That's obviously not good. To overcome that you'll need to push almost all of that logic into mochiweb_request.erl to do this properly. This means delaying md5 on chunked transfers until the trailers are read. Then as soon as you can check the md5, just throw that same exception.

        Show
        Paul Joseph Davis added a comment - Filipe, This is looking much better. I dunno what I was thinking about the HandleReq earlier. I definitely like the new approach to handling the trailiers and so on. Though, this feels like it should be part of mochiweb_request.erl and this code would just call a single function that would automatically do the trailer detection and parsing. Ideally it'd be mostly transparent to something like Req:body() as well. The to_mochiweb_headers should probably use an existing mochiweb function for header parsing so that we can blame someone when its not right. I'm pretty sure that regexp would break with white space left of the colon and leaves whitespace after the value. The update_req function looks pretty good at first pass. One of the benefits of pushing this upstream is that they'll review it as well to make sure we're not missing anything. Hmmm. I just thought, if an attachment request comes in with a content-md5 header, this patch is going to try and buffer that in RAM. That's obviously not good. To overcome that you'll need to push almost all of that logic into mochiweb_request.erl to do this properly. This means delaying md5 on chunked transfers until the trailers are read. Then as soon as you can check the md5, just throw that same exception.
        Hide
        Paul Joseph Davis added a comment -

        Oh my bad, I haven't had my required intake of chemicals to think clearly, but I don't think this going to be completely doable directly in mochiweb_request.erl. Which means that this might have to spider into the individual handler functions.

        Basically, I can't think of a good place where we differentiate between attachment requests that are streamed to disk and requests that are buffered. Preferably this patch would just work regardless, but I'm having a hard time to come up with a method that wouldn't require putting guards on each handler function which is obviously less than ideal.

        Show
        Paul Joseph Davis added a comment - Oh my bad, I haven't had my required intake of chemicals to think clearly, but I don't think this going to be completely doable directly in mochiweb_request.erl. Which means that this might have to spider into the individual handler functions. Basically, I can't think of a good place where we differentiate between attachment requests that are streamed to disk and requests that are buffered. Preferably this patch would just work regardless, but I'm having a hard time to come up with a method that wouldn't require putting guards on each handler function which is obviously less than ideal.
        Hide
        Filipe Manana added a comment -

        Thanks again for your feedback and point of view Paul,

        regarding the remark:

        "The to_mochiweb_headers should probably use an existing mochiweb function for header parsing so that we can blame someone when its not right. I'm pretty sure that regexp would break with white space left of the colon and leaves whitespace after the value."

        Well, mochiweb obtains an header's name and value directly from a call to gen_tcp:recv() for a socket configured with:

        inet:setopts(Socket, [

        {packet, http}

        ]) (mochiweb_http.erl)

        Example in mochiweb_http.erl:

        case gen_tcp:recv(Socket, 0, ?IDLE_TIMEOUT) of

        {ok, {http_header, _, Name, _, Value}} ->
        headers(Socket, Request, [

        {Name, Value}

        | Headers], Body,
        1 + HeaderCount);

        I couldn't find any function in mochiweb for parsing an header's name and value from a raw http header line string/binary

        In the meanwhile I'am improving my regex to something like: "^\s*(.?)\s:\s*(.?)\s\r\n$"
        I haven't looked yet into the source code of gen_tcp:recv().. Maybe there's a regex there I can take

        Yes, this RAM buffering might be problematic. With the statement "This means delaying md5 on chunked transfers until the trailers are read", what do you mean exactly? That instead of buffering in couch_httpd, to do the buffering in mochiweb's code? If so, is it only for a better design cause or do you have something else in mind?

        Also, while I am looking into how to write the Erlang tests with ETAP, I found a small issue with test/etap/run.tpl.
        The attached patch fixes the problem.

        cheers

        Show
        Filipe Manana added a comment - Thanks again for your feedback and point of view Paul, regarding the remark: "The to_mochiweb_headers should probably use an existing mochiweb function for header parsing so that we can blame someone when its not right. I'm pretty sure that regexp would break with white space left of the colon and leaves whitespace after the value." Well, mochiweb obtains an header's name and value directly from a call to gen_tcp:recv() for a socket configured with: inet:setopts(Socket, [ {packet, http} ]) (mochiweb_http.erl) Example in mochiweb_http.erl: case gen_tcp:recv(Socket, 0, ?IDLE_TIMEOUT) of {ok, {http_header, _, Name, _, Value}} -> headers(Socket, Request, [ {Name, Value} | Headers], Body, 1 + HeaderCount); I couldn't find any function in mochiweb for parsing an header's name and value from a raw http header line string/binary In the meanwhile I'am improving my regex to something like: "^\s*(. ?)\s :\s*(. ?)\s \r\n$" I haven't looked yet into the source code of gen_tcp:recv().. Maybe there's a regex there I can take Yes, this RAM buffering might be problematic. With the statement "This means delaying md5 on chunked transfers until the trailers are read", what do you mean exactly? That instead of buffering in couch_httpd, to do the buffering in mochiweb's code? If so, is it only for a better design cause or do you have something else in mind? Also, while I am looking into how to write the Erlang tests with ETAP, I found a small issue with test/etap/run.tpl. The attached patch fixes the problem. cheers
        Hide
        Paul Joseph Davis added a comment -

        Filipe,

        I see what you mean about the gen_tcp:recv() call. Though there are functions in mochiweb_multipart.erl that'll parse the headers. They might need a bit of generalization though. The most important part is the call to mochiweb_util:parse_header(Value). Although, looking at it, I don't detect how they do multiple line values which is most odd. I'd have to go back to the mime RFCs to figure out what's what on that end.

        The comment on delaying the check aimed at the attachment API which streams data from the socket to disk. This patch will have to figure out how to make that streaming check the MD5 in a trailer without buffering requests. I'm not sure if there's a good way to sneak this into the mochiweb_request code, or if there's a way to sneak into the couchdb code for reading bodies or what. Ideally, it wouldn't require changing the attachment handling code because that'd mean we're probably doing it wrong. Ideally it should be part of the request interface so that we don't have to sprinkle MD5 validation throughout all request handlers.

        That ETAP error is because the script isn't marked as executable. I'm pretty sure that won't pass through an SVN diff so I'll make that change with the -v flag which is generally why you want to run a single test at a time.

        Show
        Paul Joseph Davis added a comment - Filipe, I see what you mean about the gen_tcp:recv() call. Though there are functions in mochiweb_multipart.erl that'll parse the headers. They might need a bit of generalization though. The most important part is the call to mochiweb_util:parse_header(Value). Although, looking at it, I don't detect how they do multiple line values which is most odd. I'd have to go back to the mime RFCs to figure out what's what on that end. The comment on delaying the check aimed at the attachment API which streams data from the socket to disk. This patch will have to figure out how to make that streaming check the MD5 in a trailer without buffering requests. I'm not sure if there's a good way to sneak this into the mochiweb_request code, or if there's a way to sneak into the couchdb code for reading bodies or what. Ideally, it wouldn't require changing the attachment handling code because that'd mean we're probably doing it wrong. Ideally it should be part of the request interface so that we don't have to sprinkle MD5 validation throughout all request handlers. That ETAP error is because the script isn't marked as executable. I'm pretty sure that won't pass through an SVN diff so I'll make that change with the -v flag which is generally why you want to run a single test at a time.
        Hide
        Filipe Manana added a comment -

        Hum,

        The mochiweb_multipart:parse_headers function will call mochiweb_utils:parse_header which expects the header values to be of the form "X; Y=Z" as far as I understood from the source and the test method:

        test_parse_header() ->
        {"multipart/form-data", [

        {"boundary", "AaB03x"}

        ]} =
        parse_header("multipart/form-data; boundary=AaB03x"),
        ok.

        I've just discovered this now: http://www.erlang.org/doc/man/erlang.html#decode_packet-3
        Maybe if we pass the full trailer binary, it will be able to decode it as an http header. To be tested.

        Regarding the integrity checks of chunked requests, I just had an idea (but complicated, with a poor performance and incomplete):

        1) In the ChunksFun, as soon as the amount of data (sum of the length of the chunks received so far) reaches a certain value X, we start putting the chunks in a temporary file. The name of the file is put in the current state #httpd{} record.

        2) After receiving the whole request, compute the MD5 digest and compare it to the given digest. If they do not match, remove the tmp file and the file name entry in #httpd{}.

        3) The update_req/2 function will no longer replace the chunked http request with a non-chunked http request.

        4) Modify the recv_chunked function (couch_httpd.erl) to check if the given #httpd record as a tmp file name in it. If so, it will read each chunk from it and pass it to given ChunkFun callback.

        Major obvious problems:

        1) too complicated solution
        2) poor disk performance if we have large requests and many in parallel
        3) after crashes, we risk having useless tmp files lying around on disk
        4) to compute the md5 digest, we still need to read the whole content ("unchunked") into memory. Do you now of any "incremental" MD5 digest implementation? I've never heard about it.

        Have you come up with any idea of this sort?

        Possibly it might be interesting to check how httpd servers like Apache deal with this situation (if they do, or they just buffer all the chunks in memory).

        cheers

        Show
        Filipe Manana added a comment - Hum, The mochiweb_multipart:parse_headers function will call mochiweb_utils:parse_header which expects the header values to be of the form "X; Y=Z" as far as I understood from the source and the test method: test_parse_header() -> {"multipart/form-data", [ {"boundary", "AaB03x"} ]} = parse_header("multipart/form-data; boundary=AaB03x"), ok. I've just discovered this now: http://www.erlang.org/doc/man/erlang.html#decode_packet-3 Maybe if we pass the full trailer binary, it will be able to decode it as an http header. To be tested. Regarding the integrity checks of chunked requests, I just had an idea (but complicated, with a poor performance and incomplete): 1) In the ChunksFun, as soon as the amount of data (sum of the length of the chunks received so far) reaches a certain value X, we start putting the chunks in a temporary file. The name of the file is put in the current state #httpd{} record. 2) After receiving the whole request, compute the MD5 digest and compare it to the given digest. If they do not match, remove the tmp file and the file name entry in #httpd{}. 3) The update_req/2 function will no longer replace the chunked http request with a non-chunked http request. 4) Modify the recv_chunked function (couch_httpd.erl) to check if the given #httpd record as a tmp file name in it. If so, it will read each chunk from it and pass it to given ChunkFun callback. Major obvious problems: 1) too complicated solution 2) poor disk performance if we have large requests and many in parallel 3) after crashes, we risk having useless tmp files lying around on disk 4) to compute the md5 digest, we still need to read the whole content ("unchunked") into memory. Do you now of any "incremental" MD5 digest implementation? I've never heard about it. Have you come up with any idea of this sort? Possibly it might be interesting to check how httpd servers like Apache deal with this situation (if they do, or they just buffer all the chunks in memory). cheers
        Hide
        Paul Joseph Davis added a comment -

        Filipe,

        decode_packet looks like it's what we want. But I'd have to play with it some to see for certain.

        Also, sorry to not mention this earlier, but attachments already calculate their MD5 as they're streamed to disk. Every MD5 implementation I've ever seen has an incremental implementation with a one call shortcut. Hashing algorithms like these are generally designed for exactly this reason of avoiding an entire copy of the dataset in RAM.

        The temp file solution is far too complex, but we already have access to a computed MD5 so it shouldn't be necessary. All that should be necessary is adding the code to parse out footers when they're reached.

        Show
        Paul Joseph Davis added a comment - Filipe, decode_packet looks like it's what we want. But I'd have to play with it some to see for certain. Also, sorry to not mention this earlier, but attachments already calculate their MD5 as they're streamed to disk. Every MD5 implementation I've ever seen has an incremental implementation with a one call shortcut. Hashing algorithms like these are generally designed for exactly this reason of avoiding an entire copy of the dataset in RAM. The temp file solution is far too complex, but we already have access to a computed MD5 so it shouldn't be necessary. All that should be necessary is adding the code to parse out footers when they're reached.
        Hide
        Filipe Manana added a comment -

        Definitely, erlang:decode_packet/3 works fine, so now the trailer function is:

        %% @spec to_mochiweb_headers([binary()]) -> headers()
        %%
        %% Transforms the given binary list into a Mochiweb
        %% headers structure. Each binary is a raw HTTP header
        %% line (e.g. <<"Content-Lengh: 345\r\n">>).
        %%
        to_mochiweb_headers(BinaryList) ->
        F = fun(Bin, Acc) ->
        { ok,

        {http_header, _, H, _, V}

        , _ } =
        erlang:decode_packet(httph, list_to_binary([Bin,"\r\n"]), []),
        [

        {H, V}

        | Acc ]
        end,
        mochiweb_headers:make(lists:foldr(F, [], BinaryList)).

        Works fine as long as mochiweb returns the trailer as a list of binaries having the format <<"Header: value\r\n">>.

        Hum, I was not aware that the attachment handling code already calculate MD5 digests and neither that it was possible to calculate MD5 digests incrementally

        I am looking into couch_db.erl, flush_att() function, where the trailer (named footer there) is ignored. I'll take some time to have a good understanding of how the doc attachment works. This is a completely unfamiliar territory for me.

        I see that whenever a chunk is received, it is immediately written to an output stream, and an incremental MD5 is calculated. Is this MD5 currently used for something?
        When receiving the trailer, if it contains a Content-MD5 header and the value doesn't match the one calculated by couchdb, what would be the action to take? Just report an error? Or something more radical like deleting the attachment or mark it as corrupted?

        Thanks for the tip.

        best regards

        Show
        Filipe Manana added a comment - Definitely, erlang:decode_packet/3 works fine, so now the trailer function is: %% @spec to_mochiweb_headers( [binary()] ) -> headers() %% %% Transforms the given binary list into a Mochiweb %% headers structure. Each binary is a raw HTTP header %% line (e.g. <<"Content-Lengh: 345\r\n">>). %% to_mochiweb_headers(BinaryList) -> F = fun(Bin, Acc) -> { ok, {http_header, _, H, _, V} , _ } = erlang:decode_packet(httph, list_to_binary( [Bin,"\r\n"] ), []), [ {H, V} | Acc ] end, mochiweb_headers:make(lists:foldr(F, [], BinaryList)). Works fine as long as mochiweb returns the trailer as a list of binaries having the format <<"Header: value\r\n">>. Hum, I was not aware that the attachment handling code already calculate MD5 digests and neither that it was possible to calculate MD5 digests incrementally I am looking into couch_db.erl, flush_att() function, where the trailer (named footer there) is ignored. I'll take some time to have a good understanding of how the doc attachment works. This is a completely unfamiliar territory for me. I see that whenever a chunk is received, it is immediately written to an output stream, and an incremental MD5 is calculated. Is this MD5 currently used for something? When receiving the trailer, if it contains a Content-MD5 header and the value doesn't match the one calculated by couchdb, what would be the action to take? Just report an error? Or something more radical like deleting the attachment or mark it as corrupted? Thanks for the tip. best regards
        Hide
        Paul Joseph Davis added a comment -

        That header parsing makes me feel a lot better. If that decode_packet can take a whole binary and return a list of headers, we might want to avoid the header function in mochiweb_multipart.erl as I don't think it was accounting for multi-line headers which I'm pretty sure are valid.

        As to attachment handling, basically if you get to the trailer and the MD5's don't match, just throw an error when they don't match and let the normal error handling code take over. Assuming that code works out all right, I think we could end up just patching mochiweb_request.erl to check MD5's when reading the body to protect all other handlers.

        Show
        Paul Joseph Davis added a comment - That header parsing makes me feel a lot better. If that decode_packet can take a whole binary and return a list of headers, we might want to avoid the header function in mochiweb_multipart.erl as I don't think it was accounting for multi-line headers which I'm pretty sure are valid. As to attachment handling, basically if you get to the trailer and the MD5's don't match, just throw an error when they don't match and let the normal error handling code take over. Assuming that code works out all right, I think we could end up just patching mochiweb_request.erl to check MD5's when reading the body to protect all other handlers.
        Hide
        Filipe Manana added a comment -

        I even prefer:

        %% @spec to_mochiweb_headers(binary()) -> headers()
        %%
        %% Transforms the given binary, which represents a full raw
        %% HTTP header (ending with double CRLF) into a mochiweb headers
        %% structure.
        %%
        to_mochiweb_headers(RawHttpHeader) ->
        to_mochiweb_headers(RawHttpHeader, []).

        to_mochiweb_headers(RawHttpHeader, Acc) ->
        { ok,

        {http_header, _, H, _, V}

        , Rest } =
        erlang:decode_packet(httph, RawHttpHeader, []),
        case Rest of
        <<"\r\n">> ->
        mochiweb_headers:make([

        {H, V} | Acc]);
        R ->
        to_mochiweb_headers(R, [{H, V}

        | Acc])
        end.

        This way we just pass an entire raw header, e.g. to_mochiweb_headers( ?l2b( [TrailerFromMochiweb, "\r\n"] ) ).
        In this case, it should go to mochiweb_header.erl or mochiweb_util.erl.

        Everything seems more clear now.

        Does couchdb has custom code in the mochiweb lib found in its source repository? Or it's just the plain, standard mochiweb lib duplicated from googlecode?

        Show
        Filipe Manana added a comment - I even prefer: %% @spec to_mochiweb_headers(binary()) -> headers() %% %% Transforms the given binary, which represents a full raw %% HTTP header (ending with double CRLF) into a mochiweb headers %% structure. %% to_mochiweb_headers(RawHttpHeader) -> to_mochiweb_headers(RawHttpHeader, []). to_mochiweb_headers(RawHttpHeader, Acc) -> { ok, {http_header, _, H, _, V} , Rest } = erlang:decode_packet(httph, RawHttpHeader, []), case Rest of <<"\r\n">> -> mochiweb_headers:make([ {H, V} | Acc]); R -> to_mochiweb_headers(R, [{H, V} | Acc]) end. This way we just pass an entire raw header, e.g. to_mochiweb_headers( ?l2b( [TrailerFromMochiweb, "\r\n"] ) ). In this case, it should go to mochiweb_header.erl or mochiweb_util.erl. Everything seems more clear now. Does couchdb has custom code in the mochiweb lib found in its source repository? Or it's just the plain, standard mochiweb lib duplicated from googlecode?
        Hide
        Filipe Manana added a comment -
        Show
        Filipe Manana added a comment - I've just submitted a patch for mochiweb: http://mochiweb.googlecode.com/issues/attachment?aid=-3623435624803196521&name=useful_functions.patch cheers
        Hide
        Filipe Manana added a comment -

        Hello,

        Paul, I took some time to go further from couch_httpd/couch_httpd_db and explore couch_db and couch_stream as well.
        This patch has a completely different approach from the previous ones, as it checks only for the integrity of incoming attachments' requests, including the Content-MD5 field either in the http header or in the footer when having chunked http encodings.

        I also added an Etap test case.

        Can you provide some feedback on it?

        cheers

        Show
        Filipe Manana added a comment - Hello, Paul, I took some time to go further from couch_httpd/couch_httpd_db and explore couch_db and couch_stream as well. This patch has a completely different approach from the previous ones, as it checks only for the integrity of incoming attachments' requests, including the Content-MD5 field either in the http header or in the footer when having chunked http encodings. I also added an Etap test case. Can you provide some feedback on it? cheers
        Hide
        Filipe Manana added a comment -

        This 5th patch no longer breaks the 050-stream.t test case. The break was caused by base64:decode/1 calls on non base64 encoded data (resulting in an exception).

        cheers

        Show
        Filipe Manana added a comment - This 5th patch no longer breaks the 050-stream.t test case. The break was caused by base64:decode/1 calls on non base64 encoded data (resulting in an exception). cheers
        Hide
        Filipe Manana added a comment -

        Relatively to the 4th/5th patch:

        + simplified a few statements
        + formated case statements to have the same indentation style as the remaining code of couchdb

        Show
        Filipe Manana added a comment - Relatively to the 4th/5th patch: + simplified a few statements + formated case statements to have the same indentation style as the remaining code of couchdb
        Hide
        Filipe Manana added a comment -

        I was doing an error in the previous patches: doing base64 decoding of the received chunks in couch_stream.

        As RFC 2616 (http://tools.ietf.org/html/rfc2616#section-14.15) says, the MD5 digest is calculated after any content encoding is applied to the body. In Couch's case, the content encoding is base64 for the attachments, which is implicit in case the "Content-Encoding: base64" header is missing in the attachment upload request.

        Therefore, this patch no longer modifies couch_stream.erl and is more minimalistic now.

        Feedback is welcome.

        cheers

        Show
        Filipe Manana added a comment - I was doing an error in the previous patches: doing base64 decoding of the received chunks in couch_stream. As RFC 2616 ( http://tools.ietf.org/html/rfc2616#section-14.15 ) says, the MD5 digest is calculated after any content encoding is applied to the body. In Couch's case, the content encoding is base64 for the attachments, which is implicit in case the "Content-Encoding: base64" header is missing in the attachment upload request. Therefore, this patch no longer modifies couch_stream.erl and is more minimalistic now. Feedback is welcome. cheers
        Hide
        Paul Joseph Davis added a comment -

        Filipe,

        Apologies for being quiet the last couple days. I was on a trip down to NYC.

        The patch is definitely starting to look much better, but there are a couple things. When using the standalone attachment API like you are in the tests, the data isn't supposed to be base64 encoded. The base64 encoding only applies when including attachment data directly in the JSON documents.

        I think this is getting much closer to being good to go. Can you check and see if there's more to the diff though? I'm not seeing a definition of the check_md5 function. Once I have that I'll apply and check it out locally, then it should be pretty quick to get in.

        Thanks for all your work on this ticket.

        Show
        Paul Joseph Davis added a comment - Filipe, Apologies for being quiet the last couple days. I was on a trip down to NYC. The patch is definitely starting to look much better, but there are a couple things. When using the standalone attachment API like you are in the tests, the data isn't supposed to be base64 encoded. The base64 encoding only applies when including attachment data directly in the JSON documents. I think this is getting much closer to being good to go. Can you check and see if there's more to the diff though? I'm not seeing a definition of the check_md5 function. Once I have that I'll apply and check it out locally, then it should be pretty quick to get in. Thanks for all your work on this ticket.
        Hide
        Filipe Manana added a comment -

        Hello Paul,

        no problem.

        I though attachments were always base64 encoded. Never mind, the test case is now corrected. I have to start reading the online book.

        The check_md5/2 function was not written by me. It's in couch_db.erl line 685-687. It never threw an exception before since the md5 filed of an #att{} record was always the empty binary <<>>. I suppose this field and function were added with md5 integrity check in mind for the future.

        I also simplified the mochiweb_headers.erl header parsing function and added more tests to its test/0 function. I submitted a patch with this addition to the mochiweb project, but so far I had no awnser ( http://code.google.com/p/mochiweb/issues/detail?id=48 ).

        I do have a question. In couch_util I noticed the declaration of some custom base64 encoding/decoding functions. Why do they exist? Is there any problem with Erlang's base64 module?

        cheers

        Show
        Filipe Manana added a comment - Hello Paul, no problem. I though attachments were always base64 encoded. Never mind, the test case is now corrected. I have to start reading the online book. The check_md5/2 function was not written by me. It's in couch_db.erl line 685-687. It never threw an exception before since the md5 filed of an #att{} record was always the empty binary <<>>. I suppose this field and function were added with md5 integrity check in mind for the future. I also simplified the mochiweb_headers.erl header parsing function and added more tests to its test/0 function. I submitted a patch with this addition to the mochiweb project, but so far I had no awnser ( http://code.google.com/p/mochiweb/issues/detail?id=48 ). I do have a question. In couch_util I noticed the declaration of some custom base64 encoding/decoding functions. Why do they exist? Is there any problem with Erlang's base64 module? cheers
        Hide
        Filipe Manana added a comment -

        Forgot in the previous patch to remove the unnecessary call to MochiReq:get_header_value/1, I noticed afterwards that we have couch_httpd:header_value/2, therefore avoiding changing one line.

        Show
        Filipe Manana added a comment - Forgot in the previous patch to remove the unnecessary call to MochiReq:get_header_value/1, I noticed afterwards that we have couch_httpd:header_value/2, therefore avoiding changing one line.
        Hide
        Filipe Manana added a comment -

        Any news on this?

        cheers

        Show
        Filipe Manana added a comment - Any news on this? cheers
        Hide
        Paul Joseph Davis added a comment -

        Filipe,

        I've put it on the todo list for this weekend to apply and test and probably commit if everything passes.

        Paul

        Show
        Paul Joseph Davis added a comment - Filipe, I've put it on the todo list for this weekend to apply and test and probably commit if everything passes. Paul
        Hide
        Filipe Manana added a comment -

        Nothing new in this new patch.

        Only modification is the use of couch_util:decode_json/1 instead of mochijson2:decode/1 in the test case.

        Just did a git svn rebase today and got the surprise that ?JSON_DECODE is now a shortcut for the new couch_util:decode_json/1 and no longer a shortcut to mochijson2:decode/1, which returns a slightly different shapped tuple. This was causing the test case to fail.

        cheers

        Show
        Filipe Manana added a comment - Nothing new in this new patch. Only modification is the use of couch_util:decode_json/1 instead of mochijson2:decode/1 in the test case. Just did a git svn rebase today and got the surprise that ?JSON_DECODE is now a shortcut for the new couch_util:decode_json/1 and no longer a shortcut to mochijson2:decode/1, which returns a slightly different shapped tuple. This was causing the test case to fail. cheers
        Hide
        Adam Kocoloski added a comment -

        Thanks for keeping this patch up-to-date, Filipe. I changed ?JSON_DECODE because some folks prefer to use CouchDB with their own versions of MochiWeb instead of the one bundled with our source. We had been using a forked MochiWeb with a different Erlang JSON decoding, but now those differences are bundled into couch_util and the MochiWeb we ship is functionally identical to the stock one from their repository.

        Show
        Adam Kocoloski added a comment - Thanks for keeping this patch up-to-date, Filipe. I changed ?JSON_DECODE because some folks prefer to use CouchDB with their own versions of MochiWeb instead of the one bundled with our source. We had been using a forked MochiWeb with a different Erlang JSON decoding, but now those differences are bundled into couch_util and the MochiWeb we ship is functionally identical to the stock one from their repository.
        Hide
        Filipe Manana added a comment -

        Hi Adam,

        I see. Thanks for the info.
        cheers

        Show
        Filipe Manana added a comment - Hi Adam, I see. Thanks for the info. cheers
        Hide
        Filipe Manana added a comment -

        Any news on this?

        cheers

        Show
        Filipe Manana added a comment - Any news on this? cheers
        Hide
        Paul Joseph Davis added a comment -

        Applied as of 891077

        Show
        Paul Joseph Davis added a comment - Applied as of 891077
        Hide
        Paul Joseph Davis added a comment -

        Just reminding myself that the mochiweb patch that this introduced has been submitted upstream:

        http://code.google.com/p/mochiweb/issues/detail?id=48

        Show
        Paul Joseph Davis added a comment - Just reminding myself that the mochiweb patch that this introduced has been submitted upstream: http://code.google.com/p/mochiweb/issues/detail?id=48
        Hide
        Filipe Manana added a comment -

        The mochiweb patch was merged on the 02 January 2010, revision 114, without any difference regarding the couch's version:

        http://code.google.com/p/mochiweb/source/diff?spec=svn114&r=114&format=side&path=/trunk/src/mochiweb_headers.erl

        Show
        Filipe Manana added a comment - The mochiweb patch was merged on the 02 January 2010, revision 114, without any difference regarding the couch's version: http://code.google.com/p/mochiweb/source/diff?spec=svn114&r=114&format=side&path=/trunk/src/mochiweb_headers.erl
        Hide
        Paul Joseph Davis added a comment -

        Awesome. Thanks for following up on that.

        Show
        Paul Joseph Davis added a comment - Awesome. Thanks for following up on that.

          People

          • Assignee:
            Paul Joseph Davis
            Reporter:
            Adam Kocoloski
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development