CouchDB
  1. CouchDB
  2. COUCHDB-271

preventing compaction from ruining the OS block cache

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.8.1, 0.9
    • Fix Version/s: None
    • Component/s: Database Core
    • Labels:
      None
    • Skill Level:
      Committers Level (Medium to Hard)

      Description

      Adam Kocolosk:

      Hi, I've noticed that compacting large DBs pretty much kills any filesystem caching benefits for CouchDB. I believe the problem is that the OS (Linux 2.6.21 kernel in my case) is caching blocks from the .compact file, even though those blocks won't be read again until compaction has finished. In the meantime, the portion of the cache dedicated to the old DB file shrinks and performance really suffers.

      I think a better mode of operation would be to advise/instruct the OS not to cache any portion of the .compact file until we're ready to replace the main DB. On Linux, specifying the POSIX_FADV_DONTNEED option to posix_fadvise() seems like the way to go:

      http://linux.die.net/man/2/posix_fadvise

      This link has a little more detail and a usage example:

      http://insights.oetiker.ch/linux/fadvise.html

      Of course, POSIX_FADV_DONTNEED isn't really available from inside the Erlang VM. Perhaps the simplest approach would be to have a helper process that we can spawn which calls that function (or its equivalent on a non-Linux OS) periodically during compaction? I'm not really sure, but I wanted to get this out on the list for discussion. Best,

        Activity

        Hide
        Jan Lehnardt added a comment -

        Damien Katz:

        The problem is we don't get access to the low level apis or flags passed in to the OS unless Erlang chooses to expose it. We have similar problems with compaction on windows because we need special flags to give us unix file semantics.

        To fix this, we'll either need the Erlang VM changed or use our own Erlang file driver interface.

        Oh yeah, one more option that is kind of crazy is to spawn a small external child process for file io. It would be a very small simple process that opens a file and responds to read/write commands from the erlang server. Then we can implement exactly the low level apis and caching behavior desired. The cost is extra IPC, but that should be small compare the the cost of a blown file cache.

        Show
        Jan Lehnardt added a comment - Damien Katz: The problem is we don't get access to the low level apis or flags passed in to the OS unless Erlang chooses to expose it. We have similar problems with compaction on windows because we need special flags to give us unix file semantics. To fix this, we'll either need the Erlang VM changed or use our own Erlang file driver interface. – Oh yeah, one more option that is kind of crazy is to spawn a small external child process for file io. It would be a very small simple process that opens a file and responds to read/write commands from the erlang server. Then we can implement exactly the low level apis and caching behavior desired. The cost is extra IPC, but that should be small compare the the cost of a blown file cache.
        Hide
        Jan Lehnardt added a comment -

        The best way to get any patch into OTP is coming up with a patch to send to erlang-patches@. If we manage to find cross-platform alternatives and a non-intrusive patch, I'd assume a high chance of Ericsson accepting the patch.

        We could make it easy and label the patch Linux-only, but we might as well do some research for at least Windows, Solaris, the BSD's and Darwin.

        Once included and released with OTP, we'd bump the minimum required version.

        If either the patch gets rejected or want to support older OTP releases, we should look into the external daemon variant.

        Show
        Jan Lehnardt added a comment - The best way to get any patch into OTP is coming up with a patch to send to erlang-patches@. If we manage to find cross-platform alternatives and a non-intrusive patch, I'd assume a high chance of Ericsson accepting the patch. We could make it easy and label the patch Linux-only, but we might as well do some research for at least Windows, Solaris, the BSD's and Darwin. Once included and released with OTP, we'd bump the minimum required version. If either the patch gets rejected or want to support older OTP releases, we should look into the external daemon variant.
        Hide
        Louis Gerbarg added a comment -

        On darwin you do this by setting the F_NOCACHE fcntl on the descriptor:

        #include <sys/fcntl.h>

        ....

        err = fcntl(file_des, F_NOCACHE);

        Show
        Louis Gerbarg added a comment - On darwin you do this by setting the F_NOCACHE fcntl on the descriptor: #include <sys/fcntl.h> .... err = fcntl(file_des, F_NOCACHE);
        Hide
        Robert Newson added a comment -

        Another way to approach this is to eliminate wholesale database writing to achieve compaction.

        Specifically, instead of a single file for a couchdb database it would be an ordered sequence of files. It's still append-only, so earlier files will contain data thats been superceded by updates, etc, just as they do today. Each file is eligible to be compacted separately by reading all the extant records from it and writing them to the end of the current file, the old file is then deleted. With this approach (c.f. Berkeley JE), compaction could be an ongoing background task, would not require 100% as much disk space as the database itself, and the current inability to swap to the .compact file in the presence of constant writes would also be addressed.

        Show
        Robert Newson added a comment - Another way to approach this is to eliminate wholesale database writing to achieve compaction. Specifically, instead of a single file for a couchdb database it would be an ordered sequence of files. It's still append-only, so earlier files will contain data thats been superceded by updates, etc, just as they do today. Each file is eligible to be compacted separately by reading all the extant records from it and writing them to the end of the current file, the old file is then deleted. With this approach (c.f. Berkeley JE), compaction could be an ongoing background task, would not require 100% as much disk space as the database itself, and the current inability to swap to the .compact file in the presence of constant writes would also be addressed.
        Hide
        Adam Kocoloski added a comment -

        0.10.0 is out the door, adjusting FixFor on all remaining unresolved issues to 0.11 by default

        Show
        Adam Kocoloski added a comment - 0.10.0 is out the door, adjusting FixFor on all remaining unresolved issues to 0.11 by default
        Hide
        Joan Touzet added a comment -

        Recommend pushing this farther out, this is non-trivial, rnewson's suggestion is massively non-backward-compatible.

        This is not a good candidate for a (soon now!) 1.3.0.

        Show
        Joan Touzet added a comment - Recommend pushing this farther out, this is non-trivial, rnewson's suggestion is massively non-backward-compatible. This is not a good candidate for a (soon now!) 1.3.0.

          People

          • Assignee:
            Unassigned
            Reporter:
            Jan Lehnardt
          • Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:

              Development