Uploaded image for project: 'CouchDB'
  1. CouchDB
  2. COUCHDB-1953

Speed up parsing of multipart/related requests

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.6.0, 1.7.0
    • Component/s: HTTP Interface
    • Labels:
      None

      Description

      Parsing of multipart/related requests searches for the MIME boundary string using the couch_httpd:find_in_binary/2 function, which can be made more efficient.

      When the boundary string is not found in its entirety in the search data, the function should then look to see if the data ends with a prefix of the string, but it currently looks for any prefix of the string almost anywhere in the search data.

      A pull request to fix this will be submitted shortly.

        Issue Links

          Activity

          Hide
          kxepal Alexander Shorin added a comment -

          Any benchmarks to measure current/new values?

          Show
          kxepal Alexander Shorin added a comment - Any benchmarks to measure current/new values?
          Hide
          nicknorth Nick North added a comment -

          I got a factor of four improvement in parsing in simple tests on a Windows installation. It depends on the attachment contents: if your attachment has a lot of "--" instances, then the current algorithm could be dramatically slower as it will keep matching them and repeatedly rescanning parts of the input.

          Show
          nicknorth Nick North added a comment - I got a factor of four improvement in parsing in simple tests on a Windows installation. It depends on the attachment contents: if your attachment has a lot of "--" instances, then the current algorithm could be dramatically slower as it will keep matching them and repeatedly rescanning parts of the input.
          Hide
          nicknorth Nick North added a comment -

          Forgot to link the pull request: it's https://github.com/apache/couchdb/pull/115.

          Further to Alexander's question on how much faster the new code is: for 4KB blocks of data and a 40-character MIME boundary, the existing code will scan almost the whole of each block that does not contain the boundary 40 times, while the new code scans the block once, and then scans the last 40 characters once (plus further potential scans of those last 40 if the first character of the boundary appears there). Furthermore the existing code will record a partial match if any prefix of the boundary occurs almost anywhere in the block, which will cause the scanning process to start up again with the remainder of the block, plus a new 4KB block tacked onto the end - this would become very slow if "-" occurs frequently in an attachment.

          Show
          nicknorth Nick North added a comment - Forgot to link the pull request: it's https://github.com/apache/couchdb/pull/115 . Further to Alexander's question on how much faster the new code is: for 4KB blocks of data and a 40-character MIME boundary, the existing code will scan almost the whole of each block that does not contain the boundary 40 times, while the new code scans the block once, and then scans the last 40 characters once (plus further potential scans of those last 40 if the first character of the boundary appears there). Furthermore the existing code will record a partial match if any prefix of the boundary occurs almost anywhere in the block, which will cause the scanning process to start up again with the remainder of the block, plus a new 4KB block tacked onto the end - this would become very slow if "-" occurs frequently in an attachment.
          Hide
          kxepal Alexander Shorin added a comment -

          Sounds great! So, the very-very worst case for multipart/related parser is an abstract big attachment full of "-" chars? Will try to make some dummy test on linux. Not blocker, just curious(: Thanks!

          Show
          kxepal Alexander Shorin added a comment - Sounds great! So, the very-very worst case for multipart/related parser is an abstract big attachment full of "-" chars? Will try to make some dummy test on linux. Not blocker, just curious(: Thanks!
          Hide
          nicknorth Nick North added a comment -

          Oh hold on - I missed a bit of parse_multipart_request. The boundary string actually begins with "\r\n--" so the worst case attachment is one full of "\r" characters. Such as any Windows text file...

          Show
          nicknorth Nick North added a comment - Oh hold on - I missed a bit of parse_multipart_request. The boundary string actually begins with "\r\n--" so the worst case attachment is one full of "\r" characters. Such as any Windows text file...
          Hide
          nicknorth Nick North added a comment -

          To help with reviewing, I've attached some tests in the file tests1953.erl. The only code change is to replace couch_httpd:find_in_binary/2 so that is what the tests concentrate on. The file is standalone: just compile it and run tests1953:tests() and it should return a list of true atoms (it's a very simple test harness).

          The tests check that finding a pattern in a binary finds the pattern when it should, does not find it when it should not, and gives the correct partial match when the binary ends with a prefix of the pattern (plus all sorts of variants with contained prefixes and the pattern occurring at different points in the binary). The file also contains cases where behaviour differs from the current implementation - in general they will be different when the pattern is not in the binary, but some prefix of the pattern occurs before the end of the binary - the existing code will match the prefix while the new code will not.

          Show
          nicknorth Nick North added a comment - To help with reviewing, I've attached some tests in the file tests1953.erl. The only code change is to replace couch_httpd:find_in_binary/2 so that is what the tests concentrate on. The file is standalone: just compile it and run tests1953:tests() and it should return a list of true atoms (it's a very simple test harness). The tests check that finding a pattern in a binary finds the pattern when it should, does not find it when it should not, and gives the correct partial match when the binary ends with a prefix of the pattern (plus all sorts of variants with contained prefixes and the pattern occurring at different points in the binary). The file also contains cases where behaviour differs from the current implementation - in general they will be different when the pattern is not in the binary, but some prefix of the pattern occurs before the end of the binary - the existing code will match the prefix while the new code will not.
          Hide
          nicknorth Nick North added a comment -

          Continuing with Alexander's benchmark question: I tried a "worst case" attachment - a 750KB Windows text files with lots of "\r\n" content. The new code is about 10 times faster in this case.

          Show
          nicknorth Nick North added a comment - Continuing with Alexander's benchmark question: I tried a "worst case" attachment - a 750KB Windows text files with lots of "\r\n" content. The new code is about 10 times faster in this case.
          Hide
          dch Dave Cottlehuber added a comment -

          +1 LGTM, also I see a +1 from Benoît in the PR.

          I pushed a branch 1953-fix-mime-multipart-parsing to ASF, based on ~ 2 weeks ago so it should merge clean to both master and 1.6.x if required.

          The tests1953.erl should go in, and some doc / release note updates too would be fantastic.

          Show
          dch Dave Cottlehuber added a comment - +1 LGTM, also I see a +1 from Benoît in the PR. I pushed a branch 1953-fix-mime-multipart-parsing to ASF, based on ~ 2 weeks ago so it should merge clean to both master and 1.6.x if required. The tests1953.erl should go in, and some doc / release note updates too would be fantastic.
          Hide
          nicknorth Nick North added a comment -

          Thanks Dave - I'll update release notes and merge to master over the weekend unless there are any objections. I'll leave any decision on 1.6 to others.

          Not quite sure what to go with tests1953.erl - it's standalone module that tests a copy of the code that's changed, as it's not otherwise accessible. That was the best way for me to demonstrate that find_in_binary does what it should, but it doesn't really fit with the testing process for CouchDb.

          Show
          nicknorth Nick North added a comment - Thanks Dave - I'll update release notes and merge to master over the weekend unless there are any objections. I'll leave any decision on 1.6 to others. Not quite sure what to go with tests1953.erl - it's standalone module that tests a copy of the code that's changed, as it's not otherwise accessible. That was the best way for me to demonstrate that find_in_binary does what it should, but it doesn't really fit with the testing process for CouchDb.
          Hide
          paul.joseph.davis Paul Joseph Davis added a comment -

          To include the tests from the attached tests1953.erl I'd move find_in_binary to couch_util.erl and export it. Then just create a file test/etap/040-util-find-in-binary.t and copy in the tests there.

          Show
          paul.joseph.davis Paul Joseph Davis added a comment - To include the tests from the attached tests1953.erl I'd move find_in_binary to couch_util.erl and export it. Then just create a file test/etap/040-util-find-in-binary.t and copy in the tests there.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit a4b1aa37867e83abda88b05d2475f84b3de8df09 in branch refs/heads/1953-fix-mime-multipart-parsing from Nick North
          [ https://git-wip-us.apache.org/repos/asf?p=couchdb.git;h=a4b1aa3 ]

          Speed up couch_httpd:find_in_binary.

          See https://issues.apache.org/jira/browse/COUCHDB-1953

          Show
          jira-bot ASF subversion and git services added a comment - Commit a4b1aa37867e83abda88b05d2475f84b3de8df09 in branch refs/heads/1953-fix-mime-multipart-parsing from Nick North [ https://git-wip-us.apache.org/repos/asf?p=couchdb.git;h=a4b1aa3 ] Speed up couch_httpd:find_in_binary. See https://issues.apache.org/jira/browse/COUCHDB-1953
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 25cceb0c51f62f032e2933a6445391ebbdd9068b in branch refs/heads/1953-fix-mime-multipart-parsing from Paul Joseph Davis
          [ https://git-wip-us.apache.org/repos/asf?p=couchdb.git;h=25cceb0 ]

          Move find_in_binary from couch_httpd to couch_util

          Simple move so that we can test it from etap. Seems a bit awkward to
          export from couch_httpd so I moved it to the util module.

          Fixes COUCHDB-1953

          Show
          jira-bot ASF subversion and git services added a comment - Commit 25cceb0c51f62f032e2933a6445391ebbdd9068b in branch refs/heads/1953-fix-mime-multipart-parsing from Paul Joseph Davis [ https://git-wip-us.apache.org/repos/asf?p=couchdb.git;h=25cceb0 ] Move find_in_binary from couch_httpd to couch_util Simple move so that we can test it from etap. Seems a bit awkward to export from couch_httpd so I moved it to the util module. Fixes COUCHDB-1953
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 94b93f761db121e40758898f59c82dc2312ed7f1 in branch refs/heads/1953-fix-mime-multipart-parsing from Paul Joseph Davis
          [ https://git-wip-us.apache.org/repos/asf?p=couchdb.git;h=94b93f7 ]

          Add tests for the couch_util:find_in_binary/2

          Based off the tests that Nick North wrote for COUCHDB-1953.

          Show
          jira-bot ASF subversion and git services added a comment - Commit 94b93f761db121e40758898f59c82dc2312ed7f1 in branch refs/heads/1953-fix-mime-multipart-parsing from Paul Joseph Davis [ https://git-wip-us.apache.org/repos/asf?p=couchdb.git;h=94b93f7 ] Add tests for the couch_util:find_in_binary/2 Based off the tests that Nick North wrote for COUCHDB-1953 .
          Hide
          paul.joseph.davis Paul Joseph Davis added a comment -

          So when I was debugging the slow replication script in the recent replication thread on user@ I got to a point where I decided to apply this. Ended up with a 20x improvement on the test which brought that slow replication back in line with the non-HTTP versions.

          That motivated me to massage the commit from Nick North so that we have the supplied tests included and everything. I ended up pushing this work to the branch that Dave created:

          https://git-wip-us.apache.org/repos/asf?p=couchdb.git;a=log;h=refs/heads/1953-fix-mime-multipart-parsing

          Definitely +1 to get this in. This is a super awesome find by Nick.

          Show
          paul.joseph.davis Paul Joseph Davis added a comment - So when I was debugging the slow replication script in the recent replication thread on user@ I got to a point where I decided to apply this. Ended up with a 20x improvement on the test which brought that slow replication back in line with the non-HTTP versions. That motivated me to massage the commit from Nick North so that we have the supplied tests included and everything. I ended up pushing this work to the branch that Dave created: https://git-wip-us.apache.org/repos/asf?p=couchdb.git;a=log;h=refs/heads/1953-fix-mime-multipart-parsing Definitely +1 to get this in. This is a super awesome find by Nick.
          Hide
          benoitc Benoit Chesneau added a comment -

          Tested the fix branch , +1 for me. (adding it to my previous +1 on the patch 2 months ago.)

          Show
          benoitc Benoit Chesneau added a comment - Tested the fix branch , +1 for me. (adding it to my previous +1 on the patch 2 months ago.)
          Hide
          nicknorth Nick North added a comment -

          Many thanks everyone. I'll polish off anything else that needs to be done and push to master later today (if no-one has beaten me to it).

          Is it worth including this in 1.6, as it turns out to fix a reported problem? I don't want to disrupt the release process so am happy not to, if it's not appropriate.

          Show
          nicknorth Nick North added a comment - Many thanks everyone. I'll polish off anything else that needs to be done and push to master later today (if no-one has beaten me to it). Is it worth including this in 1.6, as it turns out to fix a reported problem? I don't want to disrupt the release process so am happy not to, if it's not appropriate.
          Hide
          benoitc Benoit Chesneau added a comment -

          I'm +1 to have it included in 1.6.

          If the relation with #COUCHDB-1986 is confirmed It's a must have anyway.

          Show
          benoitc Benoit Chesneau added a comment - I'm +1 to have it included in 1.6. If the relation with # COUCHDB-1986 is confirmed It's a must have anyway.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 0bf823c6e5d6a7054e7cb1979eabc7695c2707e2 in branch refs/heads/master from Nick North
          [ https://git-wip-us.apache.org/repos/asf?p=couchdb.git;h=0bf823c ]

          Speed up and move couch_httpd:find_in_binary.

          See https://issues.apache.org/jira/browse/COUCHDB-1953

          Show
          jira-bot ASF subversion and git services added a comment - Commit 0bf823c6e5d6a7054e7cb1979eabc7695c2707e2 in branch refs/heads/master from Nick North [ https://git-wip-us.apache.org/repos/asf?p=couchdb.git;h=0bf823c ] Speed up and move couch_httpd:find_in_binary. See https://issues.apache.org/jira/browse/COUCHDB-1953
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 824869c3c059d887da0dbe1cd04eb244c931c27b in branch refs/heads/1994-merge-rcouch from Nick North
          [ https://git-wip-us.apache.org/repos/asf?p=couchdb.git;h=824869c ]

          Speed up and move couch_httpd:find_in_binary.

          See https://issues.apache.org/jira/browse/COUCHDB-1953

          Show
          jira-bot ASF subversion and git services added a comment - Commit 824869c3c059d887da0dbe1cd04eb244c931c27b in branch refs/heads/1994-merge-rcouch from Nick North [ https://git-wip-us.apache.org/repos/asf?p=couchdb.git;h=824869c ] Speed up and move couch_httpd:find_in_binary. See https://issues.apache.org/jira/browse/COUCHDB-1953
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit ce3e89dc9fef25aa0e038e5a621a1e7c732e66e2 in couchdb's branch refs/heads/1.6.x from Nick North
          [ https://git-wip-us.apache.org/repos/asf?p=couchdb.git;h=ce3e89d ]

          Speed up and move couch_httpd:find_in_binary.

          See https://issues.apache.org/jira/browse/COUCHDB-1953

          (cherry picked from commit 824869c3c059d887da0dbe1cd04eb244c931c27b)

          Show
          jira-bot ASF subversion and git services added a comment - Commit ce3e89dc9fef25aa0e038e5a621a1e7c732e66e2 in couchdb's branch refs/heads/1.6.x from Nick North [ https://git-wip-us.apache.org/repos/asf?p=couchdb.git;h=ce3e89d ] Speed up and move couch_httpd:find_in_binary. See https://issues.apache.org/jira/browse/COUCHDB-1953 (cherry picked from commit 824869c3c059d887da0dbe1cd04eb244c931c27b)

            People

            • Assignee:
              nicknorth Nick North
              Reporter:
              nicknorth Nick North
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development