CouchDB
  1. CouchDB
  2. COUCHDB-1023

Batching writes of BTree nodes (when possible) and in the DB updater

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Database Core
    • Labels:
      None

      Description

      Recently I started experimenting with batching writes in the DB updater.

      For a test of 100 writers of 1Kb documents for e.g., most often the updater collects between 20 and 30 documents to write.

      Currently it does a file:write operation for each one. Not only this is slower, but it implies more context switches and stressing the OS/filesystem by allocating few blocks very often (since we use a pure file append write mode). The same can be done in the BTree node writes.

      The following branch/patch, is an experiment of batching writes:

      https://github.com/fdmanana/couchdb/compare/batch_writes

      In couch_file there's a quick test method that compares the time taken to write X blocks of size Y versus writing a single block of size X * Y.
      Example:

      Eshell V5.8.2 (abort with ^G)
      1> Apache CouchDB 1.2.0aa777195-git (LogLevel=info) is starting.
      Apache CouchDB has started. Time to relax.
      [info] [<0.37.0>] Apache CouchDB has started on http://127.0.0.1:5984/

      1> couch_file:test(1000, 30).
      multi writes of 30 binaries, each of size 1000 bytes, took 1920us
      batch write of 30 binaries, each of size 1000 bytes, took 344us
      ok
      2>
      2> couch_file:test(4000, 30).
      multi writes of 30 binaries, each of size 4000 bytes, took 2002us
      batch write of 30 binaries, each of size 4000 bytes, took 700us
      ok
      3>

      One order of magnitude less is quite significant I would say.

      Lower response times are mostly noticeable when delayed_commits are set to true.
      Running a writes only test with this branch gave me:

      http://graphs.mikeal.couchone.com/#/graph/8bf31813eef7c0b7e37d1ea25902e544

      While with trunk I got:

      http://graphs.mikeal.couchone.com/#/graph/8bf31813eef7c0b7e37d1ea25902eb50

      These tests were done on Linux with ext4 (and OTP R14B01).

      However I'm still not 100% sure if this worth applying to trunk.
      Any thoughts?

        Activity

        Filipe Manana created issue -
        Hide
        Paul Joseph Davis added a comment -

        In theory the btree update is fine. I'm not entirely familiar with that part of the db updater code so I can't comment with any authority on that section. I trust that its not any more crazy than just changing enough code to enable multiple writes and what not.

        One comment I do have is that I would prefer that the couch_file api is more straight forward. For instance, the btree code has to do its own term_to_binary call when you could just create a couch_file:append_terms/2 method that would do that which would make things a bit more clean in client code.

        In a one off comment, I'm still contemplating extending the fd NIF to not break the scheduler which may make some of these sorts of "optimizations" as moot. Depending on the severity of the snowacalypse tomorrow I may have the day off and this sounds like something I might work on.

        Show
        Paul Joseph Davis added a comment - In theory the btree update is fine. I'm not entirely familiar with that part of the db updater code so I can't comment with any authority on that section. I trust that its not any more crazy than just changing enough code to enable multiple writes and what not. One comment I do have is that I would prefer that the couch_file api is more straight forward. For instance, the btree code has to do its own term_to_binary call when you could just create a couch_ file:append_terms/2 method that would do that which would make things a bit more clean in client code. In a one off comment, I'm still contemplating extending the fd NIF to not break the scheduler which may make some of these sorts of "optimizations" as moot. Depending on the severity of the snowacalypse tomorrow I may have the day off and this sounds like something I might work on.
        Hide
        Filipe Manana added a comment -

        "One comment I do have is that I would prefer that the couch_file api is more straight forward. For instance, the btree code has to do its own term_to_binary call when you could just create a couch_file:append_terms/2 method that would do that which would make things a bit more clean in client code. "

        That was sort of intentional: 1) wanted to do a quick testing; 2) by not having an append_terms_md5 version I avoid doing another map to transform each term into a binary

        But no objections to that at all

        Show
        Filipe Manana added a comment - "One comment I do have is that I would prefer that the couch_file api is more straight forward. For instance, the btree code has to do its own term_to_binary call when you could just create a couch_ file:append_terms/2 method that would do that which would make things a bit more clean in client code. " That was sort of intentional: 1) wanted to do a quick testing; 2) by not having an append_terms_md5 version I avoid doing another map to transform each term into a binary But no objections to that at all
        Hide
        Randall Leeds added a comment -

        Didn't I do this work already and not notice any significant gains? I haven't looked to see if maybe you did it differently, but here's a version with an append_terms.

        https://github.com/tilgovi/couchdb/tree/realbatchwrite

        I also have branches and patches where I experimented with other ways of changing this code path. I do like calling term_to_binary before the gen_server:call to couch_file because it should avoid a copy operation, but you have to consider what other path you're slowing down as a result.

        If it's not getting too complex for no gain my next thought would be to take the caching code you worked on before and use it to get a write-through cache that we flush asynchronously. The goal would be to let the updater do as much as possible up to the flush of the next commit group while the current one is being written. Something like this?

        Show
        Randall Leeds added a comment - Didn't I do this work already and not notice any significant gains? I haven't looked to see if maybe you did it differently, but here's a version with an append_terms. https://github.com/tilgovi/couchdb/tree/realbatchwrite I also have branches and patches where I experimented with other ways of changing this code path. I do like calling term_to_binary before the gen_server:call to couch_file because it should avoid a copy operation, but you have to consider what other path you're slowing down as a result. If it's not getting too complex for no gain my next thought would be to take the caching code you worked on before and use it to get a write-through cache that we flush asynchronously. The goal would be to let the updater do as much as possible up to the flush of the next commit group while the current one is being written. Something like this?
        Hide
        Filipe Manana added a comment -

        Hi Randall, no I wasn't aware of you're experiment.

        Just quick looking at it, the main difference seems that yours does an extra map/fold to each key tree and then maps each document to the respective summary.

        As for the term_to_binary before a gen_server call, I don't think it offers any gain. Do you or anyone knows exactly what is more expensive: converting a term to a binary or copying a term?

        And I don't think the complexity of adding a write-through cache is worth it: more code, more one server, and a new bottle neck possibly. For that I would use the delayed_writes option of the Erlang's file module.
        But, i might be wrong. Some concrete implementation and benchmarks would definitely change my mind

        Show
        Filipe Manana added a comment - Hi Randall, no I wasn't aware of you're experiment. Just quick looking at it, the main difference seems that yours does an extra map/fold to each key tree and then maps each document to the respective summary. As for the term_to_binary before a gen_server call, I don't think it offers any gain. Do you or anyone knows exactly what is more expensive: converting a term to a binary or copying a term? And I don't think the complexity of adding a write-through cache is worth it: more code, more one server, and a new bottle neck possibly. For that I would use the delayed_writes option of the Erlang's file module. But, i might be wrong. Some concrete implementation and benchmarks would definitely change my mind

          People

          • Assignee:
            Unassigned
            Reporter:
            Filipe Manana
          • Votes:
            1 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:

              Development