Uploaded image for project: 'CouchDB'
  1. CouchDB
  2. COUCHDB-3061

Decrease search time for deeply buried headers

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Database Core
    • Labels:
      None

      Description

      When a db compaction is interrupted and restarted, it's possible for the most recent header written to the .compact.meta file to be buried under many GB from the end of file. For example, an interrupted compaction left the following files:

      -rw-r--r-- 1 dbcore dbcore 47G Jun 27 04:29 /srv/db/shards/40000000-7fffffff/opendi/yellow-ng-loadbalancer.1461156255.couch.compact.data
      -rw-r--r-- 1 dbcore dbcore 12G Jun 27 05:20 /srv/db/shards/40000000-7fffffff/opendi/yellow-ng-loadbalancer.1461156255.couch.compact.meta
      

      but

      # grep -abo db_header yellow-ng-loadbalancer.1461156255.couch.compact.meta | tail -1
      4364894251:db_header
      

      which means the current algorithm must search through about 8GB of file before it gets to a header. I measured how long it takes, both on an SSD laptop and SSD server, getting similar numbers:

      (dbcore@db3.testy012.cloudant.net)18> timer:tc(couch_file,test_find_header,["/srv/jdoane/yellow-ng-loadbalancer.1461156255.couch.compact.meta", default]).
      {328335338, ...
      (node1@127.0.0.1)27> timer:tc(couch_file,test_find_header,["/Users/jay/proj/ibm/sample-data/yellow-ng-loadbalancer.1461156255.couch.compact.meta", default]).
      {426650530, ...
      

      which is between 328-427 seconds, or 19-25 MB/s.

      One reason for this relative slowness is because the current algorithm performs a disk read for every block it searches:
      https://github.com/apache/couchdb-couch/blob/master/src/couch_file.erl#L537-L539

      We can improve the speed by loading a "chunk" of many blocks into memory with a single read operation, and then search each block in memory, trading off memory for speed. Ideally, the tradeoff can be made configurable, so that existing speed/memory behavior can be retained by default.

        Issue Links

          Activity

          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user jaydoane opened a pull request:

          https://github.com/apache/couchdb-couch/pull/185

          3061 adaptive header search

          The current find_header algorithm performs one file read per block,
          which can be inefficient when the most recent header is buried deeply in
          a large file.

          With default config params, this change essentially keeps current
          behavior, reading each block into memory one at a time until a header is
          found, or beginning of file is reached. However, by setting new config
          params, it's possible to exponentially increase the number of blocks
          read into a "chunk" of memory with a single read, up to a configurable
          limit.

          For example, the following settings begin with a chunk size of one
          block, and then double in size each additional search step backward to a
          maximum limit of 16GB:

          [couchdb]
          chunk_max_size = 4096*4096
          chunk_exponent_base = 2

          Measurements for a 12GB .couch.meta file with its last header at 4GB
          shows a speed improvement of from 17x (server) to 27x (laptop).

          COUCHDB-3061

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/cloudant/couchdb-couch 3061-adaptive-header-search

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/couchdb-couch/pull/185.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #185


          commit 1c3db513a029351af6af26b4dd1da1f72f978789
          Author: Jay Doane <jay.s.doane@gmail.com>
          Date: 2016-07-14T02:05:56Z

          Rename function and variables:

          • Drop "calculate_" prefix since all functions perform calculations
          • Change "_len" to "_size" for better consistency
          • Change TotalBytes to TotalSize for better consistency

          commit abf1e76fa996327bb52ac038b3d4520662cbab15
          Author: Jay Doane <jay.s.doane@gmail.com>
          Date: 2016-07-15T23:29:02Z

          Implement adaptive header search

          The current find_header algorithm performs one file read per block,
          which can be inefficient when the most recent header is buried deeply in
          a large file.

          With default config params, this change essentially keeps current
          behavior, reading each block into memory one at a time until a header is
          found, or beginning of file is reached. However, by setting new config
          params, it's possible to exponentially increase the number of blocks
          read into a "chunk" of memory with a single read, up to a configurable
          limit.

          For example, the following settings begin with a chunk size of one
          block, and then double in size each additional search step backward to a
          maximum limit of 16GB:

          [couchdb]
          chunk_max_size = 4096*4096
          chunk_exponent_base = 2

          Measurements for a 12GB .couch.meta file with its last header at 4GB
          shows a speed improvement of from 17x (server) to 27x (laptop).

          COUCHDB-3061


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user jaydoane opened a pull request: https://github.com/apache/couchdb-couch/pull/185 3061 adaptive header search The current find_header algorithm performs one file read per block, which can be inefficient when the most recent header is buried deeply in a large file. With default config params, this change essentially keeps current behavior, reading each block into memory one at a time until a header is found, or beginning of file is reached. However, by setting new config params, it's possible to exponentially increase the number of blocks read into a "chunk" of memory with a single read, up to a configurable limit. For example, the following settings begin with a chunk size of one block, and then double in size each additional search step backward to a maximum limit of 16GB: [couchdb] chunk_max_size = 4096*4096 chunk_exponent_base = 2 Measurements for a 12GB .couch.meta file with its last header at 4GB shows a speed improvement of from 17x (server) to 27x (laptop). COUCHDB-3061 You can merge this pull request into a Git repository by running: $ git pull https://github.com/cloudant/couchdb-couch 3061-adaptive-header-search Alternatively you can review and apply these changes as the patch at: https://github.com/apache/couchdb-couch/pull/185.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #185 commit 1c3db513a029351af6af26b4dd1da1f72f978789 Author: Jay Doane <jay.s.doane@gmail.com> Date: 2016-07-14T02:05:56Z Rename function and variables: Drop "calculate_" prefix since all functions perform calculations Change "_len" to "_size" for better consistency Change TotalBytes to TotalSize for better consistency commit abf1e76fa996327bb52ac038b3d4520662cbab15 Author: Jay Doane <jay.s.doane@gmail.com> Date: 2016-07-15T23:29:02Z Implement adaptive header search The current find_header algorithm performs one file read per block, which can be inefficient when the most recent header is buried deeply in a large file. With default config params, this change essentially keeps current behavior, reading each block into memory one at a time until a header is found, or beginning of file is reached. However, by setting new config params, it's possible to exponentially increase the number of blocks read into a "chunk" of memory with a single read, up to a configurable limit. For example, the following settings begin with a chunk size of one block, and then double in size each additional search step backward to a maximum limit of 16GB: [couchdb] chunk_max_size = 4096*4096 chunk_exponent_base = 2 Measurements for a 12GB .couch.meta file with its last header at 4GB shows a speed improvement of from 17x (server) to 27x (laptop). COUCHDB-3061
          Hide
          jaydoane Jay Doane added a comment -
          Show
          jaydoane Jay Doane added a comment - Opened related issue https://issues.apache.org/jira/browse/COUCHDB-3065
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit d5fbb6dd0d770861222b6a5de3124f10d9eafd4f in couchdb-couch's branch refs/heads/master from Jay Doane
          [ https://git-wip-us.apache.org/repos/asf?p=couchdb-couch.git;h=d5fbb6d ]

          Use vectored reads to search for buried headers

          Current behavior attempts to read a header at each block, starting at
          the eof and working backwards one block at a time. Deeply buried headers
          can take a long time to find, depending on the depth of the header,
          server load, etc.

          This commit changes the behavior so that if the last block in the file
          does not contain a header, it switches to using a "vectored" approach
          where multiple candidate header prefixes (of 5 bytes) are read in a
          single operation, greatly speeding up the search. On a modern linux
          system with SSD, we see improvements up to 15x.

          COUCHDB-3061

          Show
          jira-bot ASF subversion and git services added a comment - Commit d5fbb6dd0d770861222b6a5de3124f10d9eafd4f in couchdb-couch's branch refs/heads/master from Jay Doane [ https://git-wip-us.apache.org/repos/asf?p=couchdb-couch.git;h=d5fbb6d ] Use vectored reads to search for buried headers Current behavior attempts to read a header at each block, starting at the eof and working backwards one block at a time. Deeply buried headers can take a long time to find, depending on the depth of the header, server load, etc. This commit changes the behavior so that if the last block in the file does not contain a header, it switches to using a "vectored" approach where multiple candidate header prefixes (of 5 bytes) are read in a single operation, greatly speeding up the search. On a modern linux system with SSD, we see improvements up to 15x. COUCHDB-3061
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/couchdb-couch/pull/185

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/couchdb-couch/pull/185
          Hide
          jaydoane Jay Doane added a comment -
          Show
          jaydoane Jay Doane added a comment - This should be resolved by this commit: https://github.com/apache/couchdb-couch/commit/6997c3c236ef9c492810410077fbdff1c303bca0

            People

            • Assignee:
              jaydoane Jay Doane
              Reporter:
              jaydoane Jay Doane
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development