Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.9.0
    • Fix Version/s: 0.9.0
    • Component/s: kv
    • Labels:
      None

      Description

      RocksDB uses a write-ahead log by default. This is unnecessary in Samza, since we have full durability from a state store's changelog topic. We should disable the WAL in the RocksDB KV store.

      1. 2015-04-06 RocksDB Performance.pdf
        414 kB
        Chris Riccomini
      2. DisableWAL.diff
        11 kB
        Navina Ramesh
      3. SAMZA-543-0.patch
        4 kB
        Navina Ramesh
      4. SAMZA-543-2.patch
        15 kB
        Navina Ramesh

        Activity

        Hide
        theduderog Roger Hoover added a comment -

        Chris Riccomini, thanks for sharing!

        Show
        theduderog Roger Hoover added a comment - Chris Riccomini , thanks for sharing!
        Hide
        criccomini Chris Riccomini added a comment -

        Roger Hoover, attaching a slide deck that one of our engineers did to investigate RocksDB performance and tuning.

        Show
        criccomini Chris Riccomini added a comment - Roger Hoover , attaching a slide deck that one of our engineers did to investigate RocksDB performance and tuning.
        Hide
        criccomini Chris Riccomini added a comment -

        Could be. I think something is really weird here, though. Even at 1KB messages, seeing 17MB/s BEFORE compression seems insanely slow. Especially considering Navina Ramesh was running this on a laptop with SSDs. We definitely need more investigation on a different ticket. We are also working on beefing up the RocksDB JNI test suite (outside Samza), and pushing that upstream. We'll keep you posted on those results as well.

        Show
        criccomini Chris Riccomini added a comment - Could be. I think something is really weird here, though. Even at 1KB messages, seeing 17MB/s BEFORE compression seems insanely slow. Especially considering Navina Ramesh was running this on a laptop with SSDs. We definitely need more investigation on a different ticket. We are also working on beefing up the RocksDB JNI test suite (outside Samza), and pushing that upstream. We'll keep you posted on those results as well.
        Hide
        jkreps Jay Kreps added a comment -

        Yeah quite apart from this settings change it is surprising that you see lower throughput in MB/sec with larger messages. You should see much higher throughput. I wonder if maybe the issue is that the test is small enough that with the smaller messages you aren't really hitting disk?

        Show
        jkreps Jay Kreps added a comment - Yeah quite apart from this settings change it is surprising that you see lower throughput in MB/sec with larger messages. You should see much higher throughput. I wonder if maybe the issue is that the test is small enough that with the smaller messages you aren't really hitting disk?
        Hide
        criccomini Chris Riccomini added a comment -

        +1 Merged and committed. Thanks, Navina!

        Show
        criccomini Chris Riccomini added a comment - +1 Merged and committed. Thanks, Navina!
        Hide
        navina Navina Ramesh added a comment -

        Updated based on Yi Pan (Data Infrastructure)'s comments in RB

        Show
        navina Navina Ramesh added a comment - Updated based on Yi Pan (Data Infrastructure) 's comments in RB
        Hide
        navina Navina Ramesh added a comment -

        Yes Chris Riccomini
        This patch can be reviewed. I don't plan to investigate other improvements in this ticket.

        Show
        navina Navina Ramesh added a comment - Yes Chris Riccomini This patch can be reviewed. I don't plan to investigate other improvements in this ticket.
        Hide
        criccomini Chris Riccomini added a comment -

        Navina Ramesh, are you ready for reviews on this? Just wanted to confirm that you're not still working on it.

        Show
        criccomini Chris Riccomini added a comment - Navina Ramesh , are you ready for reviews on this? Just wanted to confirm that you're not still working on it.
        Hide
        navina Navina Ramesh added a comment -

        Yeah. I noticed that difference too. It could be due to the block size that is used in the block-based table format. I am yet to investigate in that direction.

        Show
        navina Navina Ramesh added a comment - Yeah. I noticed that difference too. It could be due to the block size that is used in the block-based table format. I am yet to investigate in that direction.
        Hide
        theduderog Roger Hoover added a comment -

        Navina Ramesh, this is great! Thank you for moving this forward.

        It's curious that the improvement was more dramatic for messages of size 512 (3x) than for size 256 (2x). Either way, a nice speed up.

        Show
        theduderog Roger Hoover added a comment - Navina Ramesh , this is great! Thank you for moving this forward. It's curious that the improvement was more dramatic for messages of size 512 (3x) than for size 256 (2x). Either way, a nice speed up.
        Hide
        navina Navina Ramesh added a comment -

        Updated the RB - https://reviews.apache.org/r/32188/

        I ran the performance test against the existing code and with the patch applied. I did see some performance improvements with WAL disabled and sync turned off (by default, it is turned off)

        Msg Size Num of Msgs Time Taken (s) Msg Rate (MB/sec) Time Taken w/ WAL Disabled (s) Msg Rate (MB/sec)
        256 1000000 6.544 37.308 3.183 76.701
        512 1000000 27.29 17.892 8.324 58.660
        1024 1000000 73.603 13.268 56.276 17.353

        I would say that is an improvement!

        Show
        navina Navina Ramesh added a comment - Updated the RB - https://reviews.apache.org/r/32188/ I ran the performance test against the existing code and with the patch applied. I did see some performance improvements with WAL disabled and sync turned off (by default, it is turned off) Msg Size Num of Msgs Time Taken (s) Msg Rate (MB/sec) Time Taken w/ WAL Disabled (s) Msg Rate (MB/sec) 256 1000000 6.544 37.308 3.183 76.701 512 1000000 27.29 17.892 8.324 58.660 1024 1000000 73.603 13.268 56.276 17.353 I would say that is an improvement!
        Hide
        theduderog Roger Hoover added a comment -

        Excellent. Thanks!

        Show
        theduderog Roger Hoover added a comment - Excellent. Thanks!
        Hide
        navina Navina Ramesh added a comment -

        Yeah. It is as good as setting another flag in the WriteOptions. I am currently working on adding that change, along with some simple perf test.

        Show
        navina Navina Ramesh added a comment - Yeah. It is as good as setting another flag in the WriteOptions. I am currently working on adding that change, along with some simple perf test.
        Hide
        theduderog Roger Hoover added a comment -

        Is is pretty easy to test disableDataSync? I'm interested in trying it out.

        Show
        theduderog Roger Hoover added a comment - Is is pretty easy to test disableDataSync? I'm interested in trying it out.
        Hide
        navina Navina Ramesh added a comment -

        Yeah. I thought so too.
        From what I understood, I think their idea is to provide more granular control over which writes get logged. Can't think of a good use of this granularity of control.

        Show
        navina Navina Ramesh added a comment - Yeah. I thought so too. From what I understood, I think their idea is to provide more granular control over which writes get logged. Can't think of a good use of this granularity of control.
        Hide
        criccomini Chris Riccomini added a comment -

        Re: patch. Do you need to specify writeOptions on every write/remove? I thought providing it just in the RocksDB constructor would be enough.

        Show
        criccomini Chris Riccomini added a comment - Re: patch. Do you need to specify writeOptions on every write/remove? I thought providing it just in the RocksDB constructor would be enough.
        Hide
        criccomini Chris Riccomini added a comment -

        should disableDataSync be set to true ? I couldn't figure out the default value used in the JNI bindings.

        We should force this to true. Looks like I was mistaken. disableDataSync defaults to false according to the Javadocs.

        Do we want to run any performance tests before we commit this change?

        I think so, just to prove that it's still faster with the WAL disabled. Maybe you can modify TestKeyValuePerformance to make it useful? Right now it just tests a very specific use case (all with deletes). It'd be nice to test bulk loading some data both with/without WAL/data sync, and see how it performs.

        My suggestion is to do this one change at a time and measure the performance improvement.

        Agreed.

        Show
        criccomini Chris Riccomini added a comment - should disableDataSync be set to true ? I couldn't figure out the default value used in the JNI bindings. We should force this to true. Looks like I was mistaken. disableDataSync defaults to false according to the Javadocs. Do we want to run any performance tests before we commit this change? I think so, just to prove that it's still faster with the WAL disabled. Maybe you can modify TestKeyValuePerformance to make it useful? Right now it just tests a very specific use case (all with deletes). It'd be nice to test bulk loading some data both with/without WAL/data sync, and see how it performs. My suggestion is to do this one change at a time and measure the performance improvement. Agreed.
        Hide
        navina Navina Ramesh added a comment -

        Having read the discussion above, I have a couple of questions:

        • should disableDataSync be set to true ? I couldn't figure out the default value used in the JNI bindings.
        • Do we want to run any performance tests before we commit this change?

        We can examine more improvements on RocksDB, such as changing the compaction style, batch write etc. My suggestion is to do this one change at a time and measure the performance improvement. Please comment on this as well!

        Show
        navina Navina Ramesh added a comment - Having read the discussion above, I have a couple of questions: should disableDataSync be set to true ? I couldn't figure out the default value used in the JNI bindings. Do we want to run any performance tests before we commit this change? We can examine more improvements on RocksDB, such as changing the compaction style, batch write etc. My suggestion is to do this one change at a time and measure the performance improvement. Please comment on this as well!
        Hide
        navina Navina Ramesh added a comment -

        I made changes to disable WAL. Verified it with the hello-samza jobs. Log file is empty. bin/check-all was successful.

        Show
        navina Navina Ramesh added a comment - I made changes to disable WAL. Verified it with the hello-samza jobs. Log file is empty. bin/check-all was successful.
        Hide
        jkreps Jay Kreps added a comment -

        Looks like we couldn't give the rocksdb folks a reproducible case: https://github.com/facebook/rocksdb/issues/262

        Show
        jkreps Jay Kreps added a comment - Looks like we couldn't give the rocksdb folks a reproducible case: https://github.com/facebook/rocksdb/issues/262
        Hide
        jkreps Jay Kreps added a comment -

        Oh, wow, yeah, well that may explain the performance. The batch write was about a 5x perf improvement for write intensive stuff in leveldb. Did you guys perf test the RocksDB stuff when it was added?

        But actually what I was describing wasn't batch write but rather disabling compaction. They have some way to do "batched compaction". If you check out their benchmark on batch load perf they describe this. Basically instead of doing a ton of compaction online during the bootstrap load, you just read all the data (which is mostly deduplicated anyway) and then when the bootstrap is complete they do a global sort. This should be a lot faster because otherwise you potentially compact each thing many times for no reason.

        Show
        jkreps Jay Kreps added a comment - Oh, wow, yeah, well that may explain the performance. The batch write was about a 5x perf improvement for write intensive stuff in leveldb. Did you guys perf test the RocksDB stuff when it was added? But actually what I was describing wasn't batch write but rather disabling compaction. They have some way to do "batched compaction". If you check out their benchmark on batch load perf they describe this. Basically instead of doing a ton of compaction online during the bootstrap load, you just read all the data (which is mostly deduplicated anyway) and then when the bootstrap is complete they do a global sort. This should be a lot faster because otherwise you potentially compact each thing many times for no reason.
        Hide
        criccomini Chris Riccomini added a comment -

        Naveen Somasundaram was seeing NPE's when trying to use RocksDB's batch writes via JNI. Haven't fix it, but it's tracked here SAMZA-436.

        Show
        criccomini Chris Riccomini added a comment - Naveen Somasundaram was seeing NPE's when trying to use RocksDB's batch writes via JNI. Haven't fix it, but it's tracked here SAMZA-436 .
        Hide
        jkreps Jay Kreps added a comment -

        Another one that would be worth trying is the bulk load facilities. For bootstrapping this should be significantly faster as it eliminates all the write amplification and compaction...

        Show
        jkreps Jay Kreps added a comment - Another one that would be worth trying is the bulk load facilities. For bootstrapping this should be significantly faster as it eliminates all the write amplification and compaction...
        Hide
        criccomini Chris Riccomini added a comment -

        Oddly, CompactionStyle.LEVEL seemed to be about 2x as fast as UNIVERSAL (Samza's default), as well. I only looked at write performance, though.

        Show
        criccomini Chris Riccomini added a comment - Oddly, CompactionStyle.LEVEL seemed to be about 2x as fast as UNIVERSAL (Samza's default), as well. I only looked at write performance, though.
        Hide
        criccomini Chris Riccomini added a comment -

        Can we also safely set disableDataSync=true for the same reason? Data is being persisted to a replicated changelog.

        I believe so, yes. I think the JNI bindings are disabling this by default.

        Show
        criccomini Chris Riccomini added a comment - Can we also safely set disableDataSync=true for the same reason? Data is being persisted to a replicated changelog. I believe so, yes. I think the JNI bindings are disabling this by default.
        Hide
        theduderog Roger Hoover added a comment - - edited

        Chris Riccomini That's great. Thanks for trying it out. Can we also safely set disableDataSync=true for the same reason? Data is being persisted to a replicated changelog.

        you can disable syncing of data files by setting Options::disableDataSync to true. This can be useful when doing bulk-loading or big idempotent operations. Once the operation is finished, you can manually call sync() to flush all dirty buffers to stable storage.

        I think we should be able to just let the OS flush the buffers whenever it feels like it.

        Show
        theduderog Roger Hoover added a comment - - edited Chris Riccomini That's great. Thanks for trying it out. Can we also safely set disableDataSync=true for the same reason? Data is being persisted to a replicated changelog. you can disable syncing of data files by setting Options::disableDataSync to true. This can be useful when doing bulk-loading or big idempotent operations. Once the operation is finished, you can manually call sync() to flush all dirty buffers to stable storage. I think we should be able to just let the OS flush the buffers whenever it feels like it.
        Hide
        criccomini Chris Riccomini added a comment -

        Roger Hoover, I ran a little test. WAL disabling seems to have a large effect on performance. My SSD is at ~170MB/s both with and without WAL disabled (this is well below it's 400MB/s peak that it should be able to do, mind you), but the records/sec is dramatically increased.

        With WAL:

        100000 rows in 649ms
        200000 rows in 1276ms
        300000 rows in 1894ms
        400000 rows in 2522ms
        500000 rows in 3177ms
        600000 rows in 3849ms
        700000 rows in 4500ms
        800000 rows in 5133ms
        900000 rows in 5770ms
        1000000 rows in 6439ms
        1100000 rows in 7145ms
        1200000 rows in 7937ms
        1300000 rows in 8596ms
        1400000 rows in 9236ms
        1500000 rows in 9913ms
        

        Without WAL:

        100000 rows in 224ms
        200000 rows in 434ms
        300000 rows in 629ms
        400000 rows in 828ms
        500000 rows in 1180ms
        600000 rows in 1465ms
        700000 rows in 1650ms
        800000 rows in 1855ms
        900000 rows in 2045ms
        1000000 rows in 3458ms
        1100000 rows in 3654ms
        1200000 rows in 3857ms
        1300000 rows in 4290ms
        1400000 rows in 4484ms
        1500000 rows in 4774ms
        

        This isn't meant to be an exhaustive proof, but it does show ~2x improvement when WAL is turned off. You can turn it off with this:

          val writeOptions = new WriteOptions
          writeOptions.setDisableWAL(true)
          db.put(writeOptions, key, value)
        

        I manually edited RocksDbKeyValueStore to do this, as a test.

        Show
        criccomini Chris Riccomini added a comment - Roger Hoover , I ran a little test. WAL disabling seems to have a large effect on performance. My SSD is at ~170MB/s both with and without WAL disabled (this is well below it's 400MB/s peak that it should be able to do, mind you), but the records/sec is dramatically increased. With WAL: 100000 rows in 649ms 200000 rows in 1276ms 300000 rows in 1894ms 400000 rows in 2522ms 500000 rows in 3177ms 600000 rows in 3849ms 700000 rows in 4500ms 800000 rows in 5133ms 900000 rows in 5770ms 1000000 rows in 6439ms 1100000 rows in 7145ms 1200000 rows in 7937ms 1300000 rows in 8596ms 1400000 rows in 9236ms 1500000 rows in 9913ms Without WAL: 100000 rows in 224ms 200000 rows in 434ms 300000 rows in 629ms 400000 rows in 828ms 500000 rows in 1180ms 600000 rows in 1465ms 700000 rows in 1650ms 800000 rows in 1855ms 900000 rows in 2045ms 1000000 rows in 3458ms 1100000 rows in 3654ms 1200000 rows in 3857ms 1300000 rows in 4290ms 1400000 rows in 4484ms 1500000 rows in 4774ms This isn't meant to be an exhaustive proof, but it does show ~2x improvement when WAL is turned off. You can turn it off with this: val writeOptions = new WriteOptions writeOptions.setDisableWAL( true ) db.put(writeOptions, key, value) I manually edited RocksDbKeyValueStore to do this, as a test.
        Hide
        theduderog Roger Hoover added a comment -

        Thanks, Chris! I want to try this!

        Show
        theduderog Roger Hoover added a comment - Thanks, Chris! I want to try this!
        Hide
        criccomini Chris Riccomini added a comment -

        Roger Hoover, you might have a look at this as a way to increase recovery time on stateful job start.

        Show
        criccomini Chris Riccomini added a comment - Roger Hoover , you might have a look at this as a way to increase recovery time on stateful job start.

          People

          • Assignee:
            navina Navina Ramesh
            Reporter:
            criccomini Chris Riccomini
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development