CouchDB
  1. CouchDB
  2. COUCHDB-568

When delayed_commits = true, keep updated btree nodes in memory until the commit

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.10
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Skill Level:
      Committers Level (Medium to Hard)

      Description

      rnewson reported on IRC that the new batch=ok implementation results in significantly larger overhead in the .couch files. This makes sense; the old batch mode waited 1 second before saving, but the new implementation just updates the doc asynchronously. With fast hardware and moderate write rates it's likely that each document is being written separately.

      The overhead presumably arises from frequently updated btree inner nodes being written to disk many times over. I'm interested in exploring a modification of the delayed_commits mode whereby the updated btree nodes are not actually written to disk immediately, but are instead held in memory until the commit. I'd like to think that this will result in more compact files without any decrease in durability. New read requests would still be able to access these in-memory nodes.

      I realize the notion that updates go directly to disk is baked pretty deeply into couch_btree, but I still thought this was worth bringing up to a wider audience.

        Activity

        Hide
        Paul Joseph Davis added a comment -

        Definitely an interesting idea. In the thread on parallelized b~trees I was basically thinking that we could take Damien's modifications to couch_db_updater.erl and push all that logic down into couch_btree.erl which would allow multiple mappers to make more efficient use of btree updates in view generation.

        Allowing a b~tree to hold new nodes in RAM for some duration before being synced should allow for similar speedups in the condensed tree writes like batch docs did.

        Though, this could allow for readers to see a database in a state that was never on disk. With batch writes we never allowed readers to see the docs in the write buffer, so the progression of db state was unidirectional. With pending writes viewable, an error could make the btree state go backwards. If that makes sense.

        Much to contemplate.

        Show
        Paul Joseph Davis added a comment - Definitely an interesting idea. In the thread on parallelized b~trees I was basically thinking that we could take Damien's modifications to couch_db_updater.erl and push all that logic down into couch_btree.erl which would allow multiple mappers to make more efficient use of btree updates in view generation. Allowing a b~tree to hold new nodes in RAM for some duration before being synced should allow for similar speedups in the condensed tree writes like batch docs did. Though, this could allow for readers to see a database in a state that was never on disk. With batch writes we never allowed readers to see the docs in the write buffer, so the progression of db state was unidirectional. With pending writes viewable, an error could make the btree state go backwards. If that makes sense. Much to contemplate.
        Hide
        Adam Kocoloski added a comment -

        Yes, an error could make the state go backwords, just like killing couch before the fsync.

        Show
        Adam Kocoloski added a comment - Yes, an error could make the state go backwords, just like killing couch before the fsync.
        Hide
        Paul Joseph Davis added a comment -

        Oh, good point. I might have to focus on trying to write that code for the b~tree updater. It definitely sounds fun.

        Show
        Paul Joseph Davis added a comment - Oh, good point. I might have to focus on trying to write that code for the b~tree updater. It definitely sounds fun.
        Hide
        Damien Katz added a comment -

        I'm not sure how much parallelizing the writes buys you much with an append only architecture. Right now with complex keys, it's possible to make the CPU the bottleneck (as we saw with the Raindrop Megaview problem), but typically the bottleneck is how fast we can stream the writes to disk. I think it would be more fruitful to optimize the current architecture for CPU and memory IO until we make the disk the bottleneck.

        Another possible optimization would be to have a bunch of uncommitted actions that stay in memory, and as you read-query the btrees, the actions are applied on the fly into the nodes at the appropriate time. I've wanted to do this a long time, just never got around to it as it's fairly complex and would need to be applied to most every btree operation. Reductions would be trickiest.

        But with this optimization its possible to do 2 things:
        1. virtually insert/remove stuff into the btree but not to disk.
        2. write these actions to disk but not to the btrees, batching them up for more efficient btree inserts, and still committing them to disk. The batch amount would need to be tuned for optimal performance.

        Show
        Damien Katz added a comment - I'm not sure how much parallelizing the writes buys you much with an append only architecture. Right now with complex keys, it's possible to make the CPU the bottleneck (as we saw with the Raindrop Megaview problem), but typically the bottleneck is how fast we can stream the writes to disk. I think it would be more fruitful to optimize the current architecture for CPU and memory IO until we make the disk the bottleneck. Another possible optimization would be to have a bunch of uncommitted actions that stay in memory, and as you read-query the btrees, the actions are applied on the fly into the nodes at the appropriate time. I've wanted to do this a long time, just never got around to it as it's fairly complex and would need to be applied to most every btree operation. Reductions would be trickiest. But with this optimization its possible to do 2 things: 1. virtually insert/remove stuff into the btree but not to disk. 2. write these actions to disk but not to the btrees, batching them up for more efficient btree inserts, and still committing them to disk. The batch amount would need to be tuned for optimal performance.
        Hide
        Paul Joseph Davis added a comment -

        My current idea is to have each b~tree use a pid between btree actions and the disk. This would allow writes to be collected and flushed simultaneously thus saving rewrites of the upper nodes, as well as having transparent "memory-only" edits to the tree that will be flushed some time in the future.

        Then for parallel writers, we allow them to traverse the tree and 'prepare writes' that can be merged into the current memory state and eventually flushed. This has the benefit of releasing all of the key collation from being bound to a single core in a fairly intuitive way.

        This would also allow a mapping onto the current database delayed_commits with a write causing the nodes in RAM to all be flushed or not as well as saving on the upper node duplications.

        Or something to that effect.

        Show
        Paul Joseph Davis added a comment - My current idea is to have each b~tree use a pid between btree actions and the disk. This would allow writes to be collected and flushed simultaneously thus saving rewrites of the upper nodes, as well as having transparent "memory-only" edits to the tree that will be flushed some time in the future. Then for parallel writers, we allow them to traverse the tree and 'prepare writes' that can be merged into the current memory state and eventually flushed. This has the benefit of releasing all of the key collation from being bound to a single core in a fairly intuitive way. This would also allow a mapping onto the current database delayed_commits with a write causing the nodes in RAM to all be flushed or not as well as saving on the upper node duplications. Or something to that effect.
        Hide
        Damien Katz added a comment -

        I know this isn't what you are trying to achieve with the batching, but I just remembered something.

        Any easy way to parallelized actions and use more CPUs is to spawn_link a process to query/modify a sub node with a matching batch of key/actions, then move on to the next sub node and matching key/actions and do the same, etc, etc. Then you wait for your sub-pids to return back their result. The maximal # of recursive sub-processes active at any time would K*Log(N), where K is the number of keys we are query/modifying and N is the total number of entries in the index. In practice it wouldn't be that many processes, but it might want to limit the number of processses spawned anyway, depending on tuning.

        I think that actually would be a pretty good way to use more of the CPUs while updating an index, assuming you are CPU bound on writes. For reads of multiple keys, it likely would also be a performance win.

        Show
        Damien Katz added a comment - I know this isn't what you are trying to achieve with the batching, but I just remembered something. Any easy way to parallelized actions and use more CPUs is to spawn_link a process to query/modify a sub node with a matching batch of key/actions, then move on to the next sub node and matching key/actions and do the same, etc, etc. Then you wait for your sub-pids to return back their result. The maximal # of recursive sub-processes active at any time would K*Log(N), where K is the number of keys we are query/modifying and N is the total number of entries in the index. In practice it wouldn't be that many processes, but it might want to limit the number of processses spawned anyway, depending on tuning. I think that actually would be a pretty good way to use more of the CPUs while updating an index, assuming you are CPU bound on writes. For reads of multiple keys, it likely would also be a performance win.

          People

          • Assignee:
            Unassigned
            Reporter:
            Adam Kocoloski
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:

              Development