Details

      Description

      It seems an unnecessary oversight that we don't compress the commit log. Doing so should improve throughput, but some care will need to be taken to ensure we use as much of a segment as possible. I propose decoupling the writing of the records from the segments. Basically write into a (queue of) DirectByteBuffer, and have the sync thread compress, say, ~64K chunks every X MB written to the CL (where X is ordinarily CLS size), and then pack as many of the compressed chunks into a CLS as possible.

      1. ComitLogStress.java
        6 kB
        Ariel Weisberg
      2. logtest.txt
        10 kB
        Branimir Lambov

        Issue Links

          Activity

          Hide
          benedict Benedict added a comment -

          I would probably go one step further still:

          1. CommitLogSegment should be split into two concepts: the on-disk file and new (much smaller) in-memory micro-segments/chunks
          2. Micro-segments largely replace current CL segments as far as writers are concerned, but once full they're placed on a queue (without reordering) to be compressed;
          3. The compression queue is processed by potentially multiple threads, and whilst it may complete out of order, the results are consumed in-order only
          4. Once compressed, they're placed on a new queue for the sync thread; whether or not they are synced immediately or once a full on-disk segment's worth is buffered is probably unimportant; probably it would be good to make this configurable

          When an out-of-band, periodic or batch sync occurs, we force finish the current micro-segment, and consider ourselves synced() once that segment has worked its way through the pipeline and hit the disk

          There are at least a few other issues to consider:

          1. We need to be able to create oversized micro-segments for when our mutations are larger than this
          2. We probably want to bound the amount of memory we can have waiting outstanding for a sync. Since we are currently limited by the buffer cache, one simple solution might be to have separate sync and flush threads, so that we can quickly flush data into the OS buffer cache without waiting for a sync. Otherwise we can manage our own internal dynamically sized pool.
          3. This design might permit us to to simplify segment file allocation (CommitLogSegmentManager)?

          All in all it's not a trivial change if we want to do it right, and it is probably not going to benefit that many users, so I'm inclined to put this one on the back burner for now.

          Show
          benedict Benedict added a comment - I would probably go one step further still: CommitLogSegment should be split into two concepts: the on-disk file and new (much smaller) in-memory micro-segments/chunks Micro-segments largely replace current CL segments as far as writers are concerned, but once full they're placed on a queue (without reordering) to be compressed; The compression queue is processed by potentially multiple threads, and whilst it may complete out of order, the results are consumed in-order only Once compressed, they're placed on a new queue for the sync thread; whether or not they are synced immediately or once a full on-disk segment's worth is buffered is probably unimportant; probably it would be good to make this configurable When an out-of-band, periodic or batch sync occurs, we force finish the current micro-segment, and consider ourselves synced() once that segment has worked its way through the pipeline and hit the disk There are at least a few other issues to consider: We need to be able to create oversized micro-segments for when our mutations are larger than this We probably want to bound the amount of memory we can have waiting outstanding for a sync. Since we are currently limited by the buffer cache, one simple solution might be to have separate sync and flush threads, so that we can quickly flush data into the OS buffer cache without waiting for a sync. Otherwise we can manage our own internal dynamically sized pool. This design might permit us to to simplify segment file allocation (CommitLogSegmentManager)? All in all it's not a trivial change if we want to do it right, and it is probably not going to benefit that many users, so I'm inclined to put this one on the back burner for now.
          Hide
          tjake T Jake Luciani added a comment -

          All in all it's not a trivial change if we want to do it right, and it is probably not going to benefit that many users, so I'm inclined to put this one on the back burner for now.

          I'm inclined to agree, I'm not sure how big the gain would be so I'd like to experiment first anyway.

          Show
          tjake T Jake Luciani added a comment - All in all it's not a trivial change if we want to do it right, and it is probably not going to benefit that many users, so I'm inclined to put this one on the back burner for now. I'm inclined to agree, I'm not sure how big the gain would be so I'd like to experiment first anyway.
          Hide
          jasobrown Jason Brown added a comment -

          So, I'll be honest, I'm not sure why this complicated solution is better than using a simple file-level compression (conceptually, something like adding another OutputStream decorator). I don't believe we need a slavish fixation on always staying within the pre-allocated size bound of the file - using less than what's declared in the file is a minor inefficiency, and using more (making the file larger) should just write out the extra data anyways (perhaps not contiguously), perhaps with a minor/trivial penalty (one that I'm not sure is measurable / affects the real-world behavior of cassandra). The file will get resized to the declared size in the yaml upon recycling, anyways, so I don't see a real disk consumption problem.

          As I'm working on CASSANDRA-6018, the simpler solution is looking very sensible and is easy to reason about. Admittedly, I'm building that on top of 2.0 and haven't even begun thinking about merging into 2.1 (and the changes that have happened there). But, I think before we go down this of micro-segments and more threads/pools, can we reevaluate simpler techniques that do not render the codebase more complicated for what is arguably not that big of a win in the first place?

          Show
          jasobrown Jason Brown added a comment - So, I'll be honest, I'm not sure why this complicated solution is better than using a simple file-level compression (conceptually, something like adding another OutputStream decorator). I don't believe we need a slavish fixation on always staying within the pre-allocated size bound of the file - using less than what's declared in the file is a minor inefficiency, and using more (making the file larger) should just write out the extra data anyways (perhaps not contiguously), perhaps with a minor/trivial penalty (one that I'm not sure is measurable / affects the real-world behavior of cassandra). The file will get resized to the declared size in the yaml upon recycling, anyways, so I don't see a real disk consumption problem. As I'm working on CASSANDRA-6018 , the simpler solution is looking very sensible and is easy to reason about. Admittedly, I'm building that on top of 2.0 and haven't even begun thinking about merging into 2.1 (and the changes that have happened there). But, I think before we go down this of micro-segments and more threads/pools, can we reevaluate simpler techniques that do not render the codebase more complicated for what is arguably not that big of a win in the first place?
          Hide
          benedict Benedict added a comment -

          is better than using a simple file-level compression

          • Don't want to bound by a single CPU performing compression/sync
          • Reduced file system fragmentation through obeying the size limits (less and less of a problem as we move to SSDs, but for HDDs pretty important for CL performance)
          Show
          benedict Benedict added a comment - is better than using a simple file-level compression Don't want to bound by a single CPU performing compression/sync Reduced file system fragmentation through obeying the size limits (less and less of a problem as we move to SSDs, but for HDDs pretty important for CL performance)
          Hide
          jbellis Jonathan Ellis added a comment -

          Don't want to bound by a single CPU performing compression/sync

          But I'm okay with being bound by a single cpu per disk, and CASSANDRA-7075 will allow multiple disks.

          Reduced file system fragmentation through obeying the size limits

          I'm okay with taking that minor hit, too. Simple is a feature.

          Show
          jbellis Jonathan Ellis added a comment - Don't want to bound by a single CPU performing compression/sync But I'm okay with being bound by a single cpu per disk , and CASSANDRA-7075 will allow multiple disks. Reduced file system fragmentation through obeying the size limits I'm okay with taking that minor hit, too. Simple is a feature.
          Hide
          benedict Benedict added a comment -

          It's also potentially less capable of utilising machines with small numbers of high performance disks (i.e. compression is generally slower than many SSDs can now write), so it may not be future proof, and won't utilise all of our users' hardware maximally, but I doubt that will be a major issue for the moment and it should at least improve the current situation.

          If we're comfortable with that, and with dropping CLS recycling (which I am, I was never really convinced it was worth the effort) we can simplify things quite a lot. So I'm cool with that approach.

          We'll want to limit the concurrency in this case, however, to e.g. 25% of num CPUs (so we're only 50% of physical cores on HT CPUs), as occupying every CPU with compressing/encrypting large blocks of data will result in potentially large latency spikes.

          Show
          benedict Benedict added a comment - It's also potentially less capable of utilising machines with small numbers of high performance disks (i.e. compression is generally slower than many SSDs can now write), so it may not be future proof, and won't utilise all of our users' hardware maximally, but I doubt that will be a major issue for the moment and it should at least improve the current situation. If we're comfortable with that, and with dropping CLS recycling (which I am, I was never really convinced it was worth the effort) we can simplify things quite a lot. So I'm cool with that approach. We'll want to limit the concurrency in this case, however, to e.g. 25% of num CPUs (so we're only 50% of physical cores on HT CPUs), as occupying every CPU with compressing/encrypting large blocks of data will result in potentially large latency spikes.
          Hide
          jbellis Jonathan Ellis added a comment -

          Sounds reasonable.

          Show
          jbellis Jonathan Ellis added a comment - Sounds reasonable.
          Hide
          jasobrown Jason Brown added a comment -

          It's also potentially less capable of utilising machines with... / won't utilise all of our users' hardware maximally

          Let's face it, if you're hitting the CL hard, you're also hitting the memtables, et al, pretty hard, as well, due a consistently large number of writes coming in. As long as the CL doesn't fall behind/take up too many resources/does it's job and doesn't complain, I think that's the sweet spot. I'd rather have the resources optimized for the client-facing path.

          dropping CLS recycling ... never really convinced it was worth the effort

          I tend to agree with this. I'd like to think it's a nice win, but I have no numbers to back it up. I think the pre-allocation is good (and a cheap operation, at least by LOC), but the recycling I'm not sure of. Rick Branson, it's been a few years since you implemented this in CASSANDRA-3411, how do feel about it now, with several years of production experience?

          We'll want to limit the concurrency in this case

          Does this statement apply to this ticket or CASSANDRA-7075? Seems more apropos to the later.

          Show
          jasobrown Jason Brown added a comment - It's also potentially less capable of utilising machines with... / won't utilise all of our users' hardware maximally Let's face it, if you're hitting the CL hard, you're also hitting the memtables, et al, pretty hard, as well, due a consistently large number of writes coming in. As long as the CL doesn't fall behind/take up too many resources/does it's job and doesn't complain, I think that's the sweet spot. I'd rather have the resources optimized for the client-facing path. dropping CLS recycling ... never really convinced it was worth the effort I tend to agree with this. I'd like to think it's a nice win, but I have no numbers to back it up. I think the pre-allocation is good (and a cheap operation, at least by LOC), but the recycling I'm not sure of. Rick Branson , it's been a few years since you implemented this in CASSANDRA-3411 , how do feel about it now, with several years of production experience? We'll want to limit the concurrency in this case Does this statement apply to this ticket or CASSANDRA-7075 ? Seems more apropos to the later.
          Hide
          benedict Benedict added a comment -

          Does this statement apply to this ticket or CASSANDRA-7075? Seems more apropos to the later.

          CASSANDRA-7075 has no reason to gate the concurrency by anything other than available disks, so it would be a bit out of context there, however without CASSANDRA-7075 this obviously isn't an issue here. Maxing out one CPU with compression would be very easy, however, so I suggest this ticket now depends on CASSANDRA-7075.

          I think the pre-allocation is good (and a cheap operation, at least by LOC)

          I disagree. If we're dropping recycling, we may as well allocate on the sync thread(s). No point having an extra thread and moving part when the vast majority of time is spent writing to disk. We're talking a few hundred micros extra cost, across an entire sync, which only ever stops a sync from progressing if the disk isn't keeping up, at which point we're done for anyway. With a 10s sync periodic, a cost of even 1ms is 0.01% of total disk bandwidth lost, and 1ms is a greatly exaggerated cost for creating a file. We also only pay this each time we rollover a segment, so by default at most once every 32Mb, 1ms every 32Mb is a 32Gb/s throughput rate, which is never going to come close to bottlenecking anything.

          Let's face it, if you're hitting the CL hard, you're also hitting the memtables, et al, pretty hard, as well, due a consistently large number of writes coming in. As long as the CL doesn't fall behind/take up too many resources/does it's job and doesn't complain, I think that's the sweet spot. I'd rather have the resources optimized for the client-facing path.

          Not necessarily. You could be doing in place updates, or you could have very large memtables (esp. with offheap memtables now) which could easily outpace CL without overwrites, without even needing to flush memtables.

          Show
          benedict Benedict added a comment - Does this statement apply to this ticket or CASSANDRA-7075 ? Seems more apropos to the later. CASSANDRA-7075 has no reason to gate the concurrency by anything other than available disks, so it would be a bit out of context there, however without CASSANDRA-7075 this obviously isn't an issue here. Maxing out one CPU with compression would be very easy, however, so I suggest this ticket now depends on CASSANDRA-7075 . I think the pre-allocation is good (and a cheap operation, at least by LOC) I disagree. If we're dropping recycling, we may as well allocate on the sync thread(s). No point having an extra thread and moving part when the vast majority of time is spent writing to disk. We're talking a few hundred micros extra cost, across an entire sync, which only ever stops a sync from progressing if the disk isn't keeping up, at which point we're done for anyway. With a 10s sync periodic, a cost of even 1ms is 0.01% of total disk bandwidth lost, and 1ms is a greatly exaggerated cost for creating a file. We also only pay this each time we rollover a segment, so by default at most once every 32Mb, 1ms every 32Mb is a 32Gb/s throughput rate, which is never going to come close to bottlenecking anything. Let's face it, if you're hitting the CL hard, you're also hitting the memtables, et al, pretty hard, as well, due a consistently large number of writes coming in. As long as the CL doesn't fall behind/take up too many resources/does it's job and doesn't complain, I think that's the sweet spot. I'd rather have the resources optimized for the client-facing path. Not necessarily. You could be doing in place updates, or you could have very large memtables (esp. with offheap memtables now) which could easily outpace CL without overwrites, without even needing to flush memtables.
          Hide
          benedict Benedict added a comment -

          That said, if we have a separate thread per disk, we can have it allocate its next file after its sync completes, thereby saving this cost most of the time too but avoiding any asynchronicity complexity

          Show
          benedict Benedict added a comment - That said, if we have a separate thread per disk, we can have it allocate its next file after its sync completes, thereby saving this cost most of the time too but avoiding any asynchronicity complexity
          Hide
          jasobrown Jason Brown added a comment -

          we may as well allocate on the sync thread(s) ...

          Wait, so I'm confused a bit now. Are you still proposing we go with your initial proposal as stated in the first comment of this ticket, with micro-segments and sync threads?

          Show
          jasobrown Jason Brown added a comment - we may as well allocate on the sync thread(s) ... Wait, so I'm confused a bit now. Are you still proposing we go with your initial proposal as stated in the first comment of this ticket, with micro-segments and sync threads?
          Hide
          benedict Benedict added a comment -

          Per-disk sync threads...

          Show
          benedict Benedict added a comment - Per-disk sync threads...
          Hide
          jasobrown Jason Brown added a comment - - edited

          If we're dropping recycling, ... bottlenecking anything.

          Reread this paragraph several times, now it makes sense. I wasn't thinking about the write perf, necessarily, but about having the file contiguous on disk. However, since the commit log files are, more or less, one-time use (meaning, we're not doing tons of random nor sequential I/O reads on them), I guess worrying about a large contiguous block on disk isn't necessary.

          Per-disk sync threads

          I'm still not sure sync threads, in the manner initially described above, are totally necessary. If you are worried about the time for the mmap'ed buffers to flush in the same thread that's handling all the CL entry processing + any possible compression or encryption, a simple solution might be to have a sync thread that merely invokes the mmap buffer flush. Thus, the main CL thread(s) can continue processing the new entries and writing to the mmap buffer, but the sync thread eats the cost of the msync.

          Show
          jasobrown Jason Brown added a comment - - edited If we're dropping recycling, ... bottlenecking anything. Reread this paragraph several times, now it makes sense. I wasn't thinking about the write perf, necessarily, but about having the file contiguous on disk. However, since the commit log files are, more or less, one-time use (meaning, we're not doing tons of random nor sequential I/O reads on them), I guess worrying about a large contiguous block on disk isn't necessary. Per-disk sync threads I'm still not sure sync threads, in the manner initially described above, are totally necessary. If you are worried about the time for the mmap'ed buffers to flush in the same thread that's handling all the CL entry processing + any possible compression or encryption, a simple solution might be to have a sync thread that merely invokes the mmap buffer flush. Thus, the main CL thread(s) can continue processing the new entries and writing to the mmap buffer, but the sync thread eats the cost of the msync.
          Hide
          jasobrown Jason Brown added a comment - - edited

          If we did try an async flush of the buffers in a sync thread, I think it would be instructive to actually measure that it is demonstratively beneficial, rather than assuming that it is. I'm not immediately sure how to measure that, but it's near 2am and coffee time doesn't begin for several more hours.

          Show
          jasobrown Jason Brown added a comment - - edited If we did try an async flush of the buffers in a sync thread, I think it would be instructive to actually measure that it is demonstratively beneficial, rather than assuming that it is. I'm not immediately sure how to measure that, but it's near 2am and coffee time doesn't begin for several more hours.
          Hide
          benedict Benedict added a comment -

          Thus, the main CL thread(s) can continue processing the new entries and writing to the mmap buffer,

          I think you're still thinking in 2.0 terms. In 2.1 the mutation threads write to the commit log buffer directly. We will need these buffers to be purely in memory (not mmapped) and then passed to the sync threads when ready for writing to be compressed and written to disk.

          I wasn't thinking about the write perf, necessarily, but about having the file contiguous on disk

          I'm not convinced this will behave any differently performance-wise (i.e. if we were to preallocate the file's size), however we don't now know the size of the file we'll be writing, so pre-allocating doesn't really help much anymore.

          a simple solution might be to have a sync thread that merely invokes the mmap buffer flush

          But this flush is synchronous? Specifically, we want multiple to be in flight at once. I also do not think there's any benefit to using mmap any more, since you bring it up. We should switch to regular output streams, so that we can simply wrap it in a compressed output stream. If you mean to start using async IO in Java, this isn't really any superior here - it still involves extra threads with a hidden thread pool, and extra (IMO) complexity beyond simply performing it on the relevant thread, but also would involve compressing the entire segment before writing to the file, which introduces extra latency (or having a complex to-and-fro shuttle compressed data to an async sink, waiting for result, etc), for no benefit.

          Show
          benedict Benedict added a comment - Thus, the main CL thread(s) can continue processing the new entries and writing to the mmap buffer, I think you're still thinking in 2.0 terms. In 2.1 the mutation threads write to the commit log buffer directly. We will need these buffers to be purely in memory (not mmapped) and then passed to the sync threads when ready for writing to be compressed and written to disk. I wasn't thinking about the write perf, necessarily, but about having the file contiguous on disk I'm not convinced this will behave any differently performance-wise (i.e. if we were to preallocate the file's size), however we don't now know the size of the file we'll be writing, so pre-allocating doesn't really help much anymore. a simple solution might be to have a sync thread that merely invokes the mmap buffer flush But this flush is synchronous? Specifically, we want multiple to be in flight at once. I also do not think there's any benefit to using mmap any more, since you bring it up. We should switch to regular output streams, so that we can simply wrap it in a compressed output stream. If you mean to start using async IO in Java, this isn't really any superior here - it still involves extra threads with a hidden thread pool, and extra (IMO) complexity beyond simply performing it on the relevant thread, but also would involve compressing the entire segment before writing to the file, which introduces extra latency (or having a complex to-and-fro shuttle compressed data to an async sink, waiting for result, etc), for no benefit.
          Hide
          jasobrown Jason Brown added a comment - - edited

          I think you're still thinking in 2.0 terms

          True, I haven't caught up with the CL changes for 2.1, but in my defense, I need to ship something ASAP on 2.0 . Will do my homework on 2.1.

          I'm not convinced this will behave any differently performance-wise

          Yeah, me neither. Perhaps my meandering prose didn't express that I think we can retire the pre-allocate with the segment recycle.

          mmap

          We can still use mmap with OutputStream, in fact I've done it already on my local for the encrypted commit logs. However, we should evaluate if we want the mmap functionality. I haven't thought enough about it to have a strong opinion. Either way, you still go through the page cache, which is the part I'd like to avoid.

          If you mean to start using async IO in Java ...

          Nope, at least, not here

          Show
          jasobrown Jason Brown added a comment - - edited I think you're still thinking in 2.0 terms True, I haven't caught up with the CL changes for 2.1, but in my defense, I need to ship something ASAP on 2.0 . Will do my homework on 2.1. I'm not convinced this will behave any differently performance-wise Yeah, me neither. Perhaps my meandering prose didn't express that I think we can retire the pre-allocate with the segment recycle. mmap We can still use mmap with OutputStream, in fact I've done it already on my local for the encrypted commit logs. However, we should evaluate if we want the mmap functionality. I haven't thought enough about it to have a strong opinion. Either way, you still go through the page cache, which is the part I'd like to avoid. If you mean to start using async IO in Java ... Nope, at least, not here
          Hide
          benedict Benedict added a comment -

          Sounds like we're approximately in agreement then

          FTR, I don't really want to incorporate these changes pre-3.0 on the mainline (though doesn't sound like you're suggesting it)...

          Show
          benedict Benedict added a comment - Sounds like we're approximately in agreement then FTR, I don't really want to incorporate these changes pre-3.0 on the mainline (though doesn't sound like you're suggesting it)...
          Hide
          blambov Branimir Lambov added a comment -

          I'm doing some testing before deciding what exactly to implement, and I'm seeing that the current implementation of memory-mapped IO for commit logs does not help slow HDDs, but has a very noticeable effect (10-20%) on SSDs. It also appears SSD writing is CPU-bound and not helped by compression.

          Based on this and the discussion above, my plan for implementing compression is the following:

          • the current CommitLogSegment will become an abstract base class with two subclasses that define how the buffer accepting the mutation is constructed, and how data is written from it.
          • one subclass will be the current implementation, with a memory-mapped buffer to achieve the least CPU overhead and generally provide the fastest path for SSDs when encryption is not required.
          • the other will construct in-memory buffers of the size of the CLS, and compress the sync sections before writing to a FileChannel (in the sync thread). It should be trivial to add encryption to this.
          • compression should improve performance on any medium if done before encryption, so I do not think it makes any sense to add support for encryption to the memory-mapped option at all.

          The ComitLogStress test I am using cannot measure the effect of recycling commit log files, and it seems that any change in that is orthogonal to adding support for compression. I will leave reevaluating the need for recycling for a separate ticket.

          Show
          blambov Branimir Lambov added a comment - I'm doing some testing before deciding what exactly to implement, and I'm seeing that the current implementation of memory-mapped IO for commit logs does not help slow HDDs, but has a very noticeable effect (10-20%) on SSDs. It also appears SSD writing is CPU-bound and not helped by compression. Based on this and the discussion above, my plan for implementing compression is the following: the current CommitLogSegment will become an abstract base class with two subclasses that define how the buffer accepting the mutation is constructed, and how data is written from it. one subclass will be the current implementation, with a memory-mapped buffer to achieve the least CPU overhead and generally provide the fastest path for SSDs when encryption is not required. the other will construct in-memory buffers of the size of the CLS, and compress the sync sections before writing to a FileChannel (in the sync thread). It should be trivial to add encryption to this. compression should improve performance on any medium if done before encryption, so I do not think it makes any sense to add support for encryption to the memory-mapped option at all. The ComitLogStress test I am using cannot measure the effect of recycling commit log files, and it seems that any change in that is orthogonal to adding support for compression. I will leave reevaluating the need for recycling for a separate ticket.
          Hide
          benedict Benedict added a comment -

          I'm +1 on your over all intended approach, but I'd appreciate some elaboration on what steps you've taken for testing:

          • What platforms?
          • What CPU config?
          • What measurement of CPU time allocation? (10-20% bump may be reasonable if its kernel time, but I'm a little surprised it has such an impact, as we'd be talking about bulk writing large quantities of linear data, which should have reasonably low CPU impact)

          If you could post your in progress branches and tests to github so I can follow what you're doing, that would be great

          Show
          benedict Benedict added a comment - I'm +1 on your over all intended approach, but I'd appreciate some elaboration on what steps you've taken for testing: What platforms? What CPU config? What measurement of CPU time allocation? (10-20% bump may be reasonable if its kernel time, but I'm a little surprised it has such an impact, as we'd be talking about bulk writing large quantities of linear data, which should have reasonably low CPU impact) If you could post your in progress branches and tests to github so I can follow what you're doing, that would be great
          Hide
          blambov Branimir Lambov added a comment - - edited

          Raw test data attached in logtest.txt.

          Show
          blambov Branimir Lambov added a comment - - edited Raw test data attached in logtest.txt.
          Hide
          jasobrown Jason Brown added a comment -

          Branimir Lambov Your proposal is very similar to to what I've implemented wrt to adding encryption to the commitlog. Instead of subclassing CLS, I created a new interface, hastily named SegmentWriter, with two impls (one non-encrypted, the other encrypted), which is then referenced from CLS. More or less, the same as what you are thinking, I'm sure.

          Also, It's not clear to me what you are measuring:

          I'm seeing that the current implementation of memory-mapped IO for commit logs does not help slow HDDs, but has a very noticeable effect (10-20%) on SSDs

          Is this the time for the sync() call to complete? So, then, you are evaluating the value of using mmap? If we're not planning on using mmap, are you thinking about buffering data ourselves (which is then flushed by the sync thread)?

          Show
          jasobrown Jason Brown added a comment - Branimir Lambov Your proposal is very similar to to what I've implemented wrt to adding encryption to the commitlog. Instead of subclassing CLS, I created a new interface, hastily named SegmentWriter, with two impls (one non-encrypted, the other encrypted), which is then referenced from CLS. More or less, the same as what you are thinking, I'm sure. Also, It's not clear to me what you are measuring: I'm seeing that the current implementation of memory-mapped IO for commit logs does not help slow HDDs, but has a very noticeable effect (10-20%) on SSDs Is this the time for the sync() call to complete? So, then, you are evaluating the value of using mmap? If we're not planning on using mmap, are you thinking about buffering data ourselves (which is then flushed by the sync thread)?
          Hide
          blambov Branimir Lambov added a comment -

          Patch is available for review at https://github.com/blambov/cassandra/pull/2:

          Provides two implementations of commit log segments, one matching the
          previous memory-mapped writing method for uncompressed logs and one
          that uses in-memory buffers and compresses sections between sync markers
          before writing to the log. Replay is changed to decompress these sections
          and keep track of the uncompressed position to correctly identify the replay
          position.

          The compression class and parameters are specified in cassandra.yaml and
          stored in the commit log descriptor.

          Tested by the test-compression target, which now enables LZ4Compression
          of commit logs in addition to compression for SSTables.

          Jason Brown: Using a writer interface will probably be a little cleaner from a design point of view, but if we want to preserve all features of the current approach the two writing methods and the log segment class are so tightly coupled that it doesn't really matter.

          The measurements I did compared the number of commit log writes of a fixed size that one could perform in a given time period (the ComitLogStress test introduced in CASSANDRA-3578 and slightly updated here). Memory-mapped IO does seem to provide some benefit at least on Windows, which for me means we should not be removing it yet.

          Show
          blambov Branimir Lambov added a comment - Patch is available for review at https://github.com/blambov/cassandra/pull/2: Provides two implementations of commit log segments, one matching the previous memory-mapped writing method for uncompressed logs and one that uses in-memory buffers and compresses sections between sync markers before writing to the log. Replay is changed to decompress these sections and keep track of the uncompressed position to correctly identify the replay position. The compression class and parameters are specified in cassandra.yaml and stored in the commit log descriptor. Tested by the test-compression target, which now enables LZ4Compression of commit logs in addition to compression for SSTables. Jason Brown : Using a writer interface will probably be a little cleaner from a design point of view, but if we want to preserve all features of the current approach the two writing methods and the log segment class are so tightly coupled that it doesn't really matter. The measurements I did compared the number of commit log writes of a fixed size that one could perform in a given time period (the ComitLogStress test introduced in CASSANDRA-3578 and slightly updated here). Memory-mapped IO does seem to provide some benefit at least on Windows, which for me means we should not be removing it yet.
          Hide
          benedict Benedict added a comment -

          Thanks, I'll get to review this soon. As Jonathan mentioned just a moment ago in another ticket, just posting the branch is sufficient, since JIRA is our main history for discussion etc., and if we ever want to comment code directly that's possible without a pull request.

          Show
          benedict Benedict added a comment - Thanks, I'll get to review this soon. As Jonathan mentioned just a moment ago in another ticket, just posting the branch is sufficient, since JIRA is our main history for discussion etc., and if we ever want to comment code directly that's possible without a pull request.
          Hide
          jasobrown Jason Brown added a comment -

          I'd also like to have a chance to review, as well. I'll try to get to it over the next few days.

          Show
          jasobrown Jason Brown added a comment - I'd also like to have a chance to review, as well. I'll try to get to it over the next few days.
          Hide
          jasobrown Jason Brown added a comment - - edited

          Branimir Lambov The github link referenced above just brings me to a page that says "There isn't anything to compare". Tried looking at your c* repo, as well, on github but no obvious branches to poke at.

          Show
          jasobrown Jason Brown added a comment - - edited Branimir Lambov The github link referenced above just brings me to a page that says "There isn't anything to compare". Tried looking at your c* repo, as well, on github but no obvious branches to poke at.
          Hide
          benedict Benedict added a comment -
          Show
          benedict Benedict added a comment - here
          Hide
          jasobrown Jason Brown added a comment -

          I think I read "compressed-ci" as "compressed-cli" and completely skipped past it . Thanks, Benedict

          Show
          jasobrown Jason Brown added a comment - I think I read "compressed-ci" as "compressed-cli" and completely skipped past it . Thanks, Benedict
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          Jason Brown Are you still interested in reviewing this, or should I reassign the reviewer? We kinda want this to happen in 3.0.

          Show
          iamaleksey Aleksey Yeschenko added a comment - Jason Brown Are you still interested in reviewing this, or should I reassign the reviewer? We kinda want this to happen in 3.0.
          Hide
          jasobrown Jason Brown added a comment -

          Please reassign as I'm not going to realistically get a chance to review for the next few weeks.

          Show
          jasobrown Jason Brown added a comment - Please reassign as I'm not going to realistically get a chance to review for the next few weeks.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          Please reassign as I'm not going to realistically get a chance to review for the next few weeks.

          Too bad Re-assigned to Ariel.

          Show
          iamaleksey Aleksey Yeschenko added a comment - Please reassign as I'm not going to realistically get a chance to review for the next few weeks. Too bad Re-assigned to Ariel.
          Hide
          aweisberg Ariel Weisberg added a comment - - edited

          I finished my review. Comments are in the pull request. It looks good and could ship as is modulo some minor changes and if the the right tests are there (or get created). I have some thoughts about potential scope creep I would advocate for. Also some other directions for enhancement the commit log could go in as well as some reservations about performance in some cases. I only just noticed CommitLog stress so I need to check that out so I can understand the numbers and what is being tested.

          RE CASSANDRA-7075 multiple CL disks. I see this as a work around for not having RAID-0 of the volumes being used for the CL and that is it. And that may introduce it's own load balancing issues as well as a mess of code for scattering/gathering mutations that I am less comfortable with. Writing a CL pipeline that can do the maximum supported sequential IO to a single file is doable, and if I had a choice it is what I would rather write. From a user perspective it is a nice feature to not to be forced to provide a RAID volume and to me that should be the primary motivation.

          Also fascinating (to me) piece of trivia. When I tested in the past I could call force() on a mapped byte buffer far fewer times then I could call force() on a FileChannel. So if I had a battery backed disk controller and I appended a page (in a preallocated file) and called force() in a loop with a MappedByteBuffer it would do a few hundreds syncs a second, but with FileChannel.force it would do a few thousand. MBB was slow enough to be a concern for synchronous commits.

          Show
          aweisberg Ariel Weisberg added a comment - - edited I finished my review. Comments are in the pull request. It looks good and could ship as is modulo some minor changes and if the the right tests are there (or get created). I have some thoughts about potential scope creep I would advocate for. Also some other directions for enhancement the commit log could go in as well as some reservations about performance in some cases. I only just noticed CommitLog stress so I need to check that out so I can understand the numbers and what is being tested. RE CASSANDRA-7075 multiple CL disks. I see this as a work around for not having RAID-0 of the volumes being used for the CL and that is it. And that may introduce it's own load balancing issues as well as a mess of code for scattering/gathering mutations that I am less comfortable with. Writing a CL pipeline that can do the maximum supported sequential IO to a single file is doable, and if I had a choice it is what I would rather write. From a user perspective it is a nice feature to not to be forced to provide a RAID volume and to me that should be the primary motivation. Also fascinating (to me) piece of trivia. When I tested in the past I could call force() on a mapped byte buffer far fewer times then I could call force() on a FileChannel. So if I had a battery backed disk controller and I appended a page (in a preallocated file) and called force() in a loop with a MappedByteBuffer it would do a few hundreds syncs a second, but with FileChannel.force it would do a few thousand. MBB was slow enough to be a concern for synchronous commits.
          Hide
          aweisberg Ariel Weisberg added a comment - - edited

          Moving conversation to JIRA

          Ariel

          What is the test coverage, unit or otherwise for compressed/uncompressed playback? It would be good to see the new code produce the mutations out of both compressed and uncompressed segments.

          I think that this kind of performance work isn't done without a micro benchmark so we know where we stand and so the next person to work on the CL can track the impact of their changes. I would really like to see a JMH test added that evaluates the compressed and uncompressed versions for throughput for tiny mutations as well as "normal" (say 1k).

          When testing compression I have found that generating compressible data doesn't seem to work well. I never got the advertised compression rates out of Snappy. You might need to grab a real corpus like some HTML or the text of alice and wonderland to give the compressor something reasonable to chew on.

          I have some concern about waiting to do compression until the end. It's fine in the common case and configuration, but there are situations where you want to start the flush immediately and it would be nice if it compressed and wrote a few hundred kilobytes at a time. You haven't taken away the mapped option so latency sensitive users can opt for that and it can be enhanced later if there is demand.

          Compression is single threaded and any time spent in force() can't also be used to do compression. This can really cut into compression throughput and it is a function of the latency of force() which varies. With synchronous commits it's can have an impact.

          The ICompressor interface and some of the implementations are also a little limiting. You should be able to compress the data directly into the memory mapped file. Snappy java supports that although LZ4 doesn't appear to. It also makes more sense to compress and then checksum. There is less data to checksum after compression, and another advantage of getting the order right is that you can use a JNI call to a native checksum implementation which will be much faster. You can use a larger checksum like murmur3 over each compressed block. The checksum will be better then a 4-byte checksum, but use less space overall.

          Yet another advantage is that Murmur3 doesn't have a lookup table evicting random cache lines (on every core) as part of every single mutation that flows through C*.

          I only bring this stuff up because this is a format change and getting it all done at once might make the amount of multi-version support code that has to be written smaller as opposed to making these changes incrementally across multiple versions.

          Benedict

          I'm not really sure JMH is the right tool for disk bound workloads?

          To add to Ariel's excellent points, another thing to consider is recycling the ByteBuffer we use, and potentially also use a DirectByteBuffer. As it stands this will create a non-trivial amount of garbage that may live long enough to be promoted (what with being disk bound).

          Ariel

          JMH doesn't solve the hard problems of for disk bound workloads certainly. It doesn't hurt either and it has some support for parameterizing and reporting. I didn't know about CommitLogStress when I asked about JMH. We don't need to build something new if something is already in place.

          I thought about the ByteBuffer pooling and lack of DBB. ICompressor doesn't accept DBB for input or output. The allocation is large so it will go straight into the old generation. C* runs CMS pretty regularly, but maybe not that regularly. It might make sense to pool two of them and if anymore are needed beyond that to allocate and discard. Since the allocation is CL segment size it's feels like a lot of memory to pin for one purpose.

          Benedict

          CommitLogStress is not very good, but JMH is architected around microbenchmarks or very small operations, which this doesn't really fit.

          We now support DBB in LZ4, so there's no reason to not roll it out here. Pooling as many as we need makes sense to me, no reason to constrain ourselves to just 2? Perhaps periodically prune if instances aren't being used. We're currently pinning as much memory as we're ahead of disk by, although admittedly here we only need it so long as we're writing to the page cache. Which is perhaps another optimisation - the buffer should be recycled/discarded as soon as we've compressed its contents.

          Show
          aweisberg Ariel Weisberg added a comment - - edited Moving conversation to JIRA Ariel What is the test coverage, unit or otherwise for compressed/uncompressed playback? It would be good to see the new code produce the mutations out of both compressed and uncompressed segments. I think that this kind of performance work isn't done without a micro benchmark so we know where we stand and so the next person to work on the CL can track the impact of their changes. I would really like to see a JMH test added that evaluates the compressed and uncompressed versions for throughput for tiny mutations as well as "normal" (say 1k). When testing compression I have found that generating compressible data doesn't seem to work well. I never got the advertised compression rates out of Snappy. You might need to grab a real corpus like some HTML or the text of alice and wonderland to give the compressor something reasonable to chew on. I have some concern about waiting to do compression until the end. It's fine in the common case and configuration, but there are situations where you want to start the flush immediately and it would be nice if it compressed and wrote a few hundred kilobytes at a time. You haven't taken away the mapped option so latency sensitive users can opt for that and it can be enhanced later if there is demand. Compression is single threaded and any time spent in force() can't also be used to do compression. This can really cut into compression throughput and it is a function of the latency of force() which varies. With synchronous commits it's can have an impact. The ICompressor interface and some of the implementations are also a little limiting. You should be able to compress the data directly into the memory mapped file. Snappy java supports that although LZ4 doesn't appear to. It also makes more sense to compress and then checksum. There is less data to checksum after compression, and another advantage of getting the order right is that you can use a JNI call to a native checksum implementation which will be much faster. You can use a larger checksum like murmur3 over each compressed block. The checksum will be better then a 4-byte checksum, but use less space overall. Yet another advantage is that Murmur3 doesn't have a lookup table evicting random cache lines (on every core) as part of every single mutation that flows through C*. I only bring this stuff up because this is a format change and getting it all done at once might make the amount of multi-version support code that has to be written smaller as opposed to making these changes incrementally across multiple versions. Benedict I'm not really sure JMH is the right tool for disk bound workloads? To add to Ariel's excellent points, another thing to consider is recycling the ByteBuffer we use, and potentially also use a DirectByteBuffer. As it stands this will create a non-trivial amount of garbage that may live long enough to be promoted (what with being disk bound). Ariel JMH doesn't solve the hard problems of for disk bound workloads certainly. It doesn't hurt either and it has some support for parameterizing and reporting. I didn't know about CommitLogStress when I asked about JMH. We don't need to build something new if something is already in place. I thought about the ByteBuffer pooling and lack of DBB. ICompressor doesn't accept DBB for input or output. The allocation is large so it will go straight into the old generation. C* runs CMS pretty regularly, but maybe not that regularly. It might make sense to pool two of them and if anymore are needed beyond that to allocate and discard. Since the allocation is CL segment size it's feels like a lot of memory to pin for one purpose. Benedict CommitLogStress is not very good, but JMH is architected around microbenchmarks or very small operations, which this doesn't really fit. We now support DBB in LZ4, so there's no reason to not roll it out here. Pooling as many as we need makes sense to me, no reason to constrain ourselves to just 2? Perhaps periodically prune if instances aren't being used. We're currently pinning as much memory as we're ahead of disk by, although admittedly here we only need it so long as we're writing to the page cache. Which is perhaps another optimisation - the buffer should be recycled/discarded as soon as we've compressed its contents.
          Hide
          blambov Branimir Lambov added a comment -

          Thanks for the review, Ariel. The plan was to implement compression without introducing too much extra complexity thus I've stayed away from adding new queues and stages dedicated to compression. Even so, a pipeline that achieves the highest possible throughput is quite doable by simply using more than one sync thread, and the size of the compression window is easily controlled via the sync period. Since we don't have an incremental compression mechanism, compression necessarily has to happen at the end, i.e. when the whole of the set of mutations to compress has been written to the buffer (this is usually not at the end of the segment).

          CASSANDRA-7075 is valuable on its own right. Arguably RAID 0 is not good enough in either performance or reliability. None of the extra complexity we introduce there is made necessary by compression-related concerns, but one of the side effects of it is the availability of more than one sync thread for compression. It is a solution of sorts to the lack of CPU saturation from this patch, but it is not at all the only way to achieve it.

          This code was written before ByteBuffer compression was made available; I will definitely make use of that now, but I wonder if that should not be a separate patch so that we don't have to block on/conflict with Jake's patch.

          The playback tests are in the various RecoveryManagerTests in o.a.c.db; the tests are the same for the uncompressed (test/testold target) and compressed case (test-compressed target). For performance tests the ultimate measure is cassandra-stress; ComitLogStress is a simple microbenchmark of how much we can push that favors compression – make sure to run it with periodic rather than batch sync. A latency test is probably needed for batch mode; this probably needs some changes to the service to make sure the sync period can go low enough for the write latency to show up.

          I have not looked at the other suggestions yet; I have to switch modes from the quite different node allocation work-- give me a couple of days.

          Show
          blambov Branimir Lambov added a comment - Thanks for the review, Ariel. The plan was to implement compression without introducing too much extra complexity thus I've stayed away from adding new queues and stages dedicated to compression. Even so, a pipeline that achieves the highest possible throughput is quite doable by simply using more than one sync thread, and the size of the compression window is easily controlled via the sync period. Since we don't have an incremental compression mechanism, compression necessarily has to happen at the end, i.e. when the whole of the set of mutations to compress has been written to the buffer (this is usually not at the end of the segment). CASSANDRA-7075 is valuable on its own right. Arguably RAID 0 is not good enough in either performance or reliability. None of the extra complexity we introduce there is made necessary by compression-related concerns, but one of the side effects of it is the availability of more than one sync thread for compression. It is a solution of sorts to the lack of CPU saturation from this patch, but it is not at all the only way to achieve it. This code was written before ByteBuffer compression was made available; I will definitely make use of that now, but I wonder if that should not be a separate patch so that we don't have to block on/conflict with Jake's patch. The playback tests are in the various RecoveryManagerTests in o.a.c.db; the tests are the same for the uncompressed (test/testold target) and compressed case (test-compressed target). For performance tests the ultimate measure is cassandra-stress; ComitLogStress is a simple microbenchmark of how much we can push that favors compression – make sure to run it with periodic rather than batch sync. A latency test is probably needed for batch mode; this probably needs some changes to the service to make sure the sync period can go low enough for the write latency to show up. I have not looked at the other suggestions yet; I have to switch modes from the quite different node allocation work-- give me a couple of days.
          Hide
          aweisberg Ariel Weisberg added a comment -

          I made some improvements to ComitLogStress as part of my tests.

          • Compressible data is taken from a text file. I used alice in wonderland.
          • No shared state between load generating threads.
          • The commit log is given the opportunity to reclaim old segments.
          • Reports the average since the beginning.
          • Can specify a rate limit although it is per thread
          Show
          aweisberg Ariel Weisberg added a comment - I made some improvements to ComitLogStress as part of my tests. Compressible data is taken from a text file. I used alice in wonderland. No shared state between load generating threads. The commit log is given the opportunity to reclaim old segments. Reports the average since the beginning. Can specify a rate limit although it is per thread
          Hide
          blambov Branimir Lambov added a comment - - edited

          Multiple sync threads weren't really supported by the code, but it wasn't very hard to make it work. I updated the branch to not rely on synchronization for all writing and added an option to use more than one thread for compression. With this LZ4 compressed logs can surpass uncompressed even on SSDs (tested with 30ms periodic sync which to me makes much more sense the huge default). The diff is here; a large portion of it is code indented differently (Is there any way to make github recognize this?).

          Admittedly this solution doesn't use threads optimally (each thread still waits for its writes to materialize), but IMHO is straightforward and simple.

          Show
          blambov Branimir Lambov added a comment - - edited Multiple sync threads weren't really supported by the code, but it wasn't very hard to make it work. I updated the branch to not rely on synchronization for all writing and added an option to use more than one thread for compression. With this LZ4 compressed logs can surpass uncompressed even on SSDs (tested with 30ms periodic sync which to me makes much more sense the huge default). The diff is here ; a large portion of it is code indented differently (Is there any way to make github recognize this?). Admittedly this solution doesn't use threads optimally (each thread still waits for its writes to materialize), but IMHO is straightforward and simple.
          Hide
          benedict Benedict added a comment -

          I haven't looked closely at the patch, but it seems to me that a simpler approach would be to split CL.sync() into a compression phase and a sync phase, so that there remains only one sync writer. The first phase would submit to an executor service, and the second would flush them to disk (possibly started immediately, but waiting on the completion of each compression chunk). If 0 compression threads, the compression executor can be an inline executor.

          Submission to the compression stage could in a follow up change be submitted by a mutator rather than the sync thread, and this would permit smaller-than-segment batches to be submitted, so that there is both less delay between sync starting and flushing, and so that parallelism can be achieved for flush of a single segment.

          The main advantage, though, is it is easier to reason about than multiple sync threads.

          Show
          benedict Benedict added a comment - I haven't looked closely at the patch, but it seems to me that a simpler approach would be to split CL.sync() into a compression phase and a sync phase, so that there remains only one sync writer. The first phase would submit to an executor service, and the second would flush them to disk (possibly started immediately, but waiting on the completion of each compression chunk). If 0 compression threads, the compression executor can be an inline executor. Submission to the compression stage could in a follow up change be submitted by a mutator rather than the sync thread, and this would permit smaller-than-segment batches to be submitted, so that there is both less delay between sync starting and flushing, and so that parallelism can be achieved for flush of a single segment. The main advantage, though, is it is easier to reason about than multiple sync threads.
          Hide
          blambov Branimir Lambov added a comment -

          The current approach boils down to multiple sync threads which

          • form sections at regular time intervals,
          • compress the section,
          • wait for any previous syncs to have retired,
          • write and flush the compressed data,
          • retire the sync.
            (In the uncompressed case a single thread which skips the second and third step.)

          Let me try to rephrase what you are saying to make sure I understand it correctly:

          • single sync thread forms sections at regular time intervals and sends them to compression executor/phase (SPMC queue),
          • compression task sends completed sections to flush executor/phase (MPSC queue, ordering and wait for the first in-flight one required),
          • flush task retires syncs in order.

          Is this what you mean? Why is this simpler, or of comparable complexity? Wouldn't the two extra queues waste resources and increase latency?

          Smaller-than-segment batches (sections) are already part of the design in both cases, assuming that the sync period is sane (e.g. ~100ms). In both approaches there's room to further separate write and flush at the expense of added complexity.

          Show
          blambov Branimir Lambov added a comment - The current approach boils down to multiple sync threads which form sections at regular time intervals, compress the section, wait for any previous syncs to have retired, write and flush the compressed data, retire the sync. (In the uncompressed case a single thread which skips the second and third step.) Let me try to rephrase what you are saying to make sure I understand it correctly: single sync thread forms sections at regular time intervals and sends them to compression executor/phase (SPMC queue), compression task sends completed sections to flush executor/phase (MPSC queue, ordering and wait for the first in-flight one required), flush task retires syncs in order. Is this what you mean? Why is this simpler, or of comparable complexity? Wouldn't the two extra queues waste resources and increase latency? Smaller-than-segment batches (sections) are already part of the design in both cases, assuming that the sync period is sane (e.g. ~100ms). In both approaches there's room to further separate write and flush at the expense of added complexity.
          Hide
          benedict Benedict added a comment -

          assuming that the sync period is sane (e.g. ~100ms)

          The sync period is, by default, 10s, and to my knowledge this is what many users run with - so in general we will only compress each individual segment. This is still sane, since the cluster has redundancy, although a sync period of between 100ms and 500ms might be more suitable for high traffic nodes. Still, it's probably not a big deal since we only care about compression when under saturation, which should mean many segments. I only mention it, since it is an easy extension. This extension also means the sync thread may have compressed data waiting for it when it runs, reducing the latency until sync completion.

          Let me try to rephrase what you are saying to make sure I understand it correctly:

          Almost:

          • single sync thread forms sections at regular time intervals and sends them to compression executor/phase (SPMC queue),
          • sync thread waits on futures and syncs each in order

          Or, with the extension:

          • mutators periodically submit segment to compressor
          • once compressor completes an entire segment, requestExtraSync() is called (instead of in advanceAllocatingFrom())

          Why is this simpler, or of comparable complexity?

          We have two steps in explanation, instead of five. More importantly, there is no interleaving of events to reason about between the sync threads, and the "lastSync" is accurate (which is important since this could artificially pause writes). This also means future improvements here are easier and safer to deliver, because we don't have to reason about how they interplay with each other. In particular, marking lastSync roll over after each segment is synced is a natural improvement (to ensure write latencies don't spike under load) but is challenging to introduce with multiple sync threads. Since we don't expect this feature to be used widely (we expect multiple CL disks to be used instead, if you're bottlenecking) the simpler approach seems more sensible to me.

          Wouldn't the two extra queues waste resources and increase latency?

          We have zero in the typical case, and one extra queue in the uncommon use case. If we introduce enough threads that compression is faster than disk, then there will be near zero synchronization costs; of course, if that is not the case, and we are bottlenecking on compression still, then we aren't really losing much (a few micros. every few hundred millis, at 250MB/s compression speed), so it doesn't seem likely to be significant.

          We're now no longer honouring the sync interval; we are syncing more frequently, which may reduce disk throughput. The exact time of syncing in relation to each other may also vary, likely into lock-step under saturation, so that there may be short periods of many competing syncs potentially yielding pathological disk behaviour, and introducing competition for the synchronized blocks inside the segments, in effect introducing a MPMC queue, eliminating those few micros of benefit.

          (FTR, the MPMC, SPMC, MPSC aspects are likely not important here. The only concern is thread signalling, but this is the wrong order of magnitude to matter when bottlenecking on disk or compression of large chunks)

          Show
          benedict Benedict added a comment - assuming that the sync period is sane (e.g. ~100ms) The sync period is, by default, 10s, and to my knowledge this is what many users run with - so in general we will only compress each individual segment. This is still sane, since the cluster has redundancy, although a sync period of between 100ms and 500ms might be more suitable for high traffic nodes. Still, it's probably not a big deal since we only care about compression when under saturation, which should mean many segments. I only mention it, since it is an easy extension. This extension also means the sync thread may have compressed data waiting for it when it runs, reducing the latency until sync completion. Let me try to rephrase what you are saying to make sure I understand it correctly: Almost: single sync thread forms sections at regular time intervals and sends them to compression executor/phase (SPMC queue), sync thread waits on futures and syncs each in order Or, with the extension: mutators periodically submit segment to compressor once compressor completes an entire segment, requestExtraSync() is called (instead of in advanceAllocatingFrom()) Why is this simpler, or of comparable complexity? We have two steps in explanation, instead of five. More importantly, there is no interleaving of events to reason about between the sync threads, and the "lastSync" is accurate (which is important since this could artificially pause writes). This also means future improvements here are easier and safer to deliver, because we don't have to reason about how they interplay with each other. In particular, marking lastSync roll over after each segment is synced is a natural improvement (to ensure write latencies don't spike under load) but is challenging to introduce with multiple sync threads. Since we don't expect this feature to be used widely (we expect multiple CL disks to be used instead, if you're bottlenecking) the simpler approach seems more sensible to me. Wouldn't the two extra queues waste resources and increase latency? We have zero in the typical case, and one extra queue in the uncommon use case. If we introduce enough threads that compression is faster than disk, then there will be near zero synchronization costs; of course, if that is not the case, and we are bottlenecking on compression still, then we aren't really losing much (a few micros. every few hundred millis, at 250MB/s compression speed), so it doesn't seem likely to be significant. We're now no longer honouring the sync interval; we are syncing more frequently, which may reduce disk throughput. The exact time of syncing in relation to each other may also vary, likely into lock-step under saturation, so that there may be short periods of many competing syncs potentially yielding pathological disk behaviour, and introducing competition for the synchronized blocks inside the segments, in effect introducing a MPMC queue, eliminating those few micros of benefit. (FTR, the MPMC, SPMC, MPSC aspects are likely not important here. The only concern is thread signalling, but this is the wrong order of magnitude to matter when bottlenecking on disk or compression of large chunks)
          Hide
          blambov Branimir Lambov added a comment -
          • single sync thread forms sections at regular time intervals and sends them to compression executor/phase (SPMC queue),
          • sync thread waits on futures and syncs each in order

          I gave your suggestion a day of development, but it still introduces more problems than it solves. I took it this far. It's already significantly more complicated than the option I proposed, and I got worse performance and still some uncertainties around recycling and shutdown.

          Perhaps I did not put this clearly, but I don't see a point introducing a trigger for compression other than a sync. Reducing write latency for a 10s sync period is of no value whatsoever; with a short period, especially in batch mode where it really matters, you wouldn't want to start a compression cycle before the batch is completed anyway (if you did, a better solution to the problem is to just compress each mutation individually). We have ample flexibility in the sync period (time) and segment size (space) to be able to use compression efficiently. Granted, this may require documenting different defaults for compression, but this is something I would much prefer to live with than extra code complexity needed to work around badly chosen parameters.

          Assuming sync-only triggering and short periods, your suggestion requires decoupling of sync starts from sync completions with a queue of sync requests in flight. That's what I implemented in the code above. Am I doing something wrong?

          Going back to the previous approach (updated to fix problem with sync possibly completing earlier than it should),

          We're now no longer honouring the sync interval; we are syncing more frequently, which may reduce disk throughput. The exact time of syncing in relation to each other may also vary, likely into lock-step under saturation, so that there may be short periods of many competing syncs potentially yielding pathological disk behaviour, and introducing competition for the synchronized blocks inside the segments, in effect introducing a MPMC queue, eliminating those few micros of benefit.

          The sync frequency is as specified, the intervals will vary, but writes to disk are still serial so disks should behave normally. There will be competition on waitForSync if compression is constantly late but, as you say, in this case the magnitude of the overheads is too small to matter. A bigger problem is that I can imagine a pathological situation where only one thread is doing work if the others have nothing to do but also become late waiting for it, and start the next cycle at the same time.

          Show
          blambov Branimir Lambov added a comment - single sync thread forms sections at regular time intervals and sends them to compression executor/phase (SPMC queue), sync thread waits on futures and syncs each in order I gave your suggestion a day of development, but it still introduces more problems than it solves. I took it this far . It's already significantly more complicated than the option I proposed, and I got worse performance and still some uncertainties around recycling and shutdown. Perhaps I did not put this clearly, but I don't see a point introducing a trigger for compression other than a sync. Reducing write latency for a 10s sync period is of no value whatsoever; with a short period, especially in batch mode where it really matters, you wouldn't want to start a compression cycle before the batch is completed anyway (if you did, a better solution to the problem is to just compress each mutation individually). We have ample flexibility in the sync period (time) and segment size (space) to be able to use compression efficiently. Granted, this may require documenting different defaults for compression, but this is something I would much prefer to live with than extra code complexity needed to work around badly chosen parameters. Assuming sync-only triggering and short periods, your suggestion requires decoupling of sync starts from sync completions with a queue of sync requests in flight. That's what I implemented in the code above. Am I doing something wrong? Going back to the previous approach (updated to fix problem with sync possibly completing earlier than it should), We're now no longer honouring the sync interval; we are syncing more frequently, which may reduce disk throughput. The exact time of syncing in relation to each other may also vary, likely into lock-step under saturation, so that there may be short periods of many competing syncs potentially yielding pathological disk behaviour, and introducing competition for the synchronized blocks inside the segments, in effect introducing a MPMC queue, eliminating those few micros of benefit. The sync frequency is as specified, the intervals will vary, but writes to disk are still serial so disks should behave normally. There will be competition on waitForSync if compression is constantly late but, as you say, in this case the magnitude of the overheads is too small to matter. A bigger problem is that I can imagine a pathological situation where only one thread is doing work if the others have nothing to do but also become late waiting for it, and start the next cycle at the same time.
          Hide
          benedict Benedict added a comment -

          It's a bit late at night to throw this together, so this is only an expression of intent, but here's quick hour's attempt at the approach I am advocating. It should be complete, but ignoring the extension ideas or exploring any other avenues, and with only around 97% care (i.e. I'm disclaiming minor mistakes). All I've done is split the sync() into a prepareSync() which returns a List of Callable<Runnable> - the callable is the compression and write-to-buffer work, and the Runnable is the FS barrier work. The former is submitted to an executor service, which may be an inline executor. The result is then executed inline in the sync thread.

          I think it is a pretty simple change, and it is very clear what its behaviour is. Let me know what you think. I think we should probably split this discussion into its own ticket, though, since the work you've done on compression is good to commit as is IMO.

          Branch here

          Show
          benedict Benedict added a comment - It's a bit late at night to throw this together, so this is only an expression of intent, but here's quick hour's attempt at the approach I am advocating. It should be complete, but ignoring the extension ideas or exploring any other avenues, and with only around 97% care (i.e. I'm disclaiming minor mistakes). All I've done is split the sync() into a prepareSync() which returns a List of Callable<Runnable> - the callable is the compression and write-to-buffer work, and the Runnable is the FS barrier work. The former is submitted to an executor service, which may be an inline executor. The result is then executed inline in the sync thread. I think it is a pretty simple change, and it is very clear what its behaviour is. Let me know what you think. I think we should probably split this discussion into its own ticket, though, since the work you've done on compression is good to commit as is IMO. Branch here
          Hide
          blambov Branimir Lambov added a comment -

          Thank you, I did not realise you are interested in parallelism between segments only. Of course, what you suggest is the right solution if we are limited to that; I approached the problem with the assumption that we need shorter sections (of the same segment) that are to progress in parallel. I can see that this should work well enough with large sync periods, including the 10s default.

          I am happy to continue with either approach, or without multithreaded compression altogether. I am now going back to addressing the individual issues Ariel raised.

          Show
          blambov Branimir Lambov added a comment - Thank you, I did not realise you are interested in parallelism between segments only. Of course, what you suggest is the right solution if we are limited to that; I approached the problem with the assumption that we need shorter sections (of the same segment) that are to progress in parallel. I can see that this should work well enough with large sync periods, including the 10s default. I am happy to continue with either approach, or without multithreaded compression altogether. I am now going back to addressing the individual issues Ariel raised.
          Hide
          benedict Benedict added a comment -

          Thank you, I did not realise you are interested in parallelism between segments only.

          Well, I considered that a natural extension, i.e. a follow up ticket. One I still consider reasonably straight forward to add: a mutator thread can partition the commit range once it's processed ~1Mb, and simply append the Callable to a shared queue. The sync thread can then drain this when it decides to initiate a sync.

          I can see that this should work well enough with large sync periods, including the 10s default.

          I'm reasonably confident this will work as well or better for all sync periods. In particular it better guarantees honouring the sync periods, and is less likely to encourage random write behaviour. Of course, the main benefit is its simplicity.

          Show
          benedict Benedict added a comment - Thank you, I did not realise you are interested in parallelism between segments only. Well, I considered that a natural extension, i.e. a follow up ticket. One I still consider reasonably straight forward to add: a mutator thread can partition the commit range once it's processed ~1Mb, and simply append the Callable to a shared queue. The sync thread can then drain this when it decides to initiate a sync. I can see that this should work well enough with large sync periods, including the 10s default. I'm reasonably confident this will work as well or better for all sync periods. In particular it better guarantees honouring the sync periods, and is less likely to encourage random write behaviour. Of course, the main benefit is its simplicity.
          Hide
          blambov Branimir Lambov added a comment -

          Rebased and applied review comments. New version uploaded here.

          Show
          blambov Branimir Lambov added a comment - Rebased and applied review comments. New version uploaded here .
          Hide
          blambov Branimir Lambov added a comment -

          Added a stress test exercising all log options to the long test suite and fixed several issues:

          • Initialization could fail with NPE or similar as some threads started within the CommitLog constructor could use fields which weren't initialized yet.
          • Made sure the length of sync sections is precise; replay could read beyond the end of the file or read stale mutations from reused compressed buffers when the length is not precise.
          • Made sure completed segment files are always closed.
          Show
          blambov Branimir Lambov added a comment - Added a stress test exercising all log options to the long test suite and fixed several issues: Initialization could fail with NPE or similar as some threads started within the CommitLog constructor could use fields which weren't initialized yet. Made sure the length of sync sections is precise; replay could read beyond the end of the file or read stale mutations from reused compressed buffers when the length is not precise. Made sure completed segment files are always closed.
          Hide
          aboudreault Alan Boudreault added a comment -

          Branimir Lambov Can you confirm if this is the desired behavior? Using commitlog_compression, the commitlog files created on disk are smaller than what we set in cassandra.yaml. Using default configuration, commitlogs are 23MB rather than 32MB. I thought that the compression would only affect the data written inside the commitlogs, and not the file size itself.

          Show
          aboudreault Alan Boudreault added a comment - Branimir Lambov Can you confirm if this is the desired behavior? Using commitlog_compression, the commitlog files created on disk are smaller than what we set in cassandra.yaml. Using default configuration, commitlogs are 23MB rather than 32MB. I thought that the compression would only affect the data written inside the commitlogs, and not the file size itself.
          Hide
          blambov Branimir Lambov added a comment -

          Yes, this is the expected behaviour. The rationale is in the discussion above-- compression with a fixed size of the compressed segment is quite a bit more complex as it requires a level of "microsegments" that mutation threads write in, in addition to what exists now.

          Show
          blambov Branimir Lambov added a comment - Yes, this is the expected behaviour. The rationale is in the discussion above-- compression with a fixed size of the compressed segment is quite a bit more complex as it requires a level of "microsegments" that mutation threads write in, in addition to what exists now.
          Hide
          aboudreault Alan Boudreault added a comment -

          Great. Thanks.

          Show
          aboudreault Alan Boudreault added a comment - Great. Thanks.
          Hide
          aweisberg Ariel Weisberg added a comment -
          • In cassandra.yaml, for commitlog_compression_threads can you document that the default is 1?
          • Cap the size of the buffer pool to a fixed number of buffers. If you end up allocating more than that number, free the memory rather then pooling.
          • awaitTermination changed from block forever to block for 3600 seconds. I have never been a fan of ExecutorService's async termination. If it was supposed to finish it's last task it should terminate and if it doesn't then something is wrong.
          • I’d still like to see the JSON field in commit log descriptor be reusable for other config parameters.
          • In CommitLogSegment constructor it is no longer truncating the file, why is it no longer necessary?

          I am more comfortable with the single threaded task generation in Benedict's diff. I would rather see that or no multi-threading for now.

          Show
          aweisberg Ariel Weisberg added a comment - In cassandra.yaml, for commitlog_compression_threads can you document that the default is 1? Cap the size of the buffer pool to a fixed number of buffers. If you end up allocating more than that number, free the memory rather then pooling. awaitTermination changed from block forever to block for 3600 seconds. I have never been a fan of ExecutorService's async termination. If it was supposed to finish it's last task it should terminate and if it doesn't then something is wrong. I’d still like to see the JSON field in commit log descriptor be reusable for other config parameters. In CommitLogSegment constructor it is no longer truncating the file, why is it no longer necessary? I am more comfortable with the single threaded task generation in Benedict's diff. I would rather see that or no multi-threading for now.
          Hide
          blambov Branimir Lambov added a comment -

          New version uploaded to the same github branch. Includes removal of recycling as agreed in CASSANDRA-8771, as the compressed path was using unnecessarily large amounts of memory for prepared segments to write in.

          I am more comfortable with the single threaded task generation in Benedict's diff. I would rather see that or no multi-threading for now.

          Rolled back the multithreaded sync capability.

          Cap the size of the buffer pool to a fixed number of buffers. If you end up allocating more than that number, free the memory rather then pooling.

          Done now. Previously the size of allocated memory was defined by commitlog_segment_size_in_mb, which unnecessarily connected two unrelated limits. Removing recycling fixes this problem.

          I’d still like to see the JSON field in commit log descriptor be reusable for other config parameters.

          Done. Caused a change in CommitLogTest as the descriptor is now larger.

          In CommitLogSegment constructor it is no longer truncating the file, why is it no longer necessary?

          As the size is not preset for CompressedSegment, it makes no sense to truncate/resize at open. The resize was thus moved to MemoryMappedSegment; it no longer needs to truncate, but allocates the space for the file. I am not sure it is really necessary (Windows at least resizes file on memmapping), but the Java docs do not guarantee mapping will succeed if the file is not big enough so I left it in place.

          I'd appreciate a test on Linux to confirm CASSANDRA-8729 is no longer a problem with uncompressed logs after this patch.

          • In cassandra.yaml, for commitlog_compression_threads can you document that the default is 1?
          • awaitTermination changed from block forever to block for 3600 seconds. I have never been a fan of ExecutorService's async termination. If it was supposed to finish it's last task it should terminate and if it doesn't then something is wrong.

          No longer relevant.

          Show
          blambov Branimir Lambov added a comment - New version uploaded to the same github branch . Includes removal of recycling as agreed in CASSANDRA-8771 , as the compressed path was using unnecessarily large amounts of memory for prepared segments to write in. I am more comfortable with the single threaded task generation in Benedict's diff. I would rather see that or no multi-threading for now. Rolled back the multithreaded sync capability. Cap the size of the buffer pool to a fixed number of buffers. If you end up allocating more than that number, free the memory rather then pooling. Done now. Previously the size of allocated memory was defined by commitlog_segment_size_in_mb, which unnecessarily connected two unrelated limits. Removing recycling fixes this problem. I’d still like to see the JSON field in commit log descriptor be reusable for other config parameters. Done. Caused a change in CommitLogTest as the descriptor is now larger. In CommitLogSegment constructor it is no longer truncating the file, why is it no longer necessary? As the size is not preset for CompressedSegment , it makes no sense to truncate/resize at open. The resize was thus moved to MemoryMappedSegment ; it no longer needs to truncate, but allocates the space for the file. I am not sure it is really necessary (Windows at least resizes file on memmapping), but the Java docs do not guarantee mapping will succeed if the file is not big enough so I left it in place. I'd appreciate a test on Linux to confirm CASSANDRA-8729 is no longer a problem with uncompressed logs after this patch. In cassandra.yaml, for commitlog_compression_threads can you document that the default is 1? awaitTermination changed from block forever to block for 3600 seconds. I have never been a fan of ExecutorService's async termination. If it was supposed to finish it's last task it should terminate and if it doesn't then something is wrong. No longer relevant.
          Hide
          aweisberg Ariel Weisberg added a comment -

          I really like what you did with CommitLogStressTest.

          At https://github.com/apache/cassandra/compare/trunk...blambov:6809-compressed-logs#diff-d07279710c482983e537aed26df80400R340
          If archiving fails it appears to delete the segment now. Is that the right thing to do?

          CSLM's understanding of segment size is skewed because compressed segments are less than the expected segment size in reality. With real compression ratios it's going to be off by 30-50%. If when the size is known it's tracking could be corrected it would be nice.

          For the buffer pooling. I would be tempted to not wait for the collector to get to the DBB. If the DBB is promoted due to compaction or some other allocation hog it may not be reclaimed for some time. In CompressedSegment.close maybe null the field then invoke the cleaner on the buffer. There is a utility method for doing that so you don't have to access the interface directly (generates a compiler warning).

          Also make MAX_BUFFERPOOL_SIZE configurable via a property. I have been prefixing internal C* properties with "cassandra." I suspect that at several hundred megabytes a second we will have more than 3 32 megabyte buffers in flight. I have a personal fear of shipping constants that aren't quite right and putting them all in properties can save waiting for code changes.

          I tested on Linux. If I drop the page cache on the new code it doesn't generate reads. I tested the old code and it generated a few hundred megabytes of reads.

          Show
          aweisberg Ariel Weisberg added a comment - I really like what you did with CommitLogStressTest. At https://github.com/apache/cassandra/compare/trunk...blambov:6809-compressed-logs#diff-d07279710c482983e537aed26df80400R340 If archiving fails it appears to delete the segment now. Is that the right thing to do? CSLM's understanding of segment size is skewed because compressed segments are less than the expected segment size in reality. With real compression ratios it's going to be off by 30-50%. If when the size is known it's tracking could be corrected it would be nice. For the buffer pooling. I would be tempted to not wait for the collector to get to the DBB. If the DBB is promoted due to compaction or some other allocation hog it may not be reclaimed for some time. In CompressedSegment.close maybe null the field then invoke the cleaner on the buffer. There is a utility method for doing that so you don't have to access the interface directly (generates a compiler warning). Also make MAX_BUFFERPOOL_SIZE configurable via a property. I have been prefixing internal C* properties with "cassandra." I suspect that at several hundred megabytes a second we will have more than 3 32 megabyte buffers in flight. I have a personal fear of shipping constants that aren't quite right and putting them all in properties can save waiting for code changes. I tested on Linux. If I drop the page cache on the new code it doesn't generate reads. I tested the old code and it generated a few hundred megabytes of reads.
          Hide
          blambov Branimir Lambov added a comment -

          Rebased and updated here, now using ByteBuffers for compressing. Some fixes to the DeflateCompressor implementation are included.

          At https://github.com/apache/cassandra/compare/trunk...blambov:6809-compressed-logs#diff-d07279710c482983e537aed26df80400R340
          If archiving fails it appears to delete the segment now. Is that the right thing to do?

          It deletes if archival was successful (deleteFile = archiveSuccess). The old code was doing the same thing, a bit more confusingly (deleteFile = !archiveSuccess ? false : true).

          CSLM's understanding of segment size is skewed because compressed segments are less than the expected segment size in reality. With real compression ratios it's going to be off by 30-50%. If when the size is known it's tracking could be corrected it would be nice.

          Unfortunately that's not trivial. Measurement must happen when the file is done writing, which is a point CLSM doesn't currently have access to; moreover that could be triggered from either sync() or close() on the segment and I don't want to include the risk that I don't get all updates working correctly into this patch. Changed the description of the parameter to reflect what it is currently measuring.

          This can and should be fixed soon after, though, and I'll open an issue as soon as this is committed.

          For the buffer pooling. I would be tempted to not wait for the collector to get to the DBB. If the DBB is promoted due to compaction or some other allocation hog it may not be reclaimed for some time. In CompressedSegment.close maybe null the field then invoke the cleaner on the buffer. There is a utility method for doing that so you don't have to access the interface directly (generates a compiler warning).

          Done.

          Also make MAX_BUFFERPOOL_SIZE configurable via a property. I have been prefixing internal C* properties with "cassandra." I suspect that at several hundred megabytes a second we will have more than 3 32 megabyte buffers in flight. I have a personal fear of shipping constants that aren't quite right and putting them all in properties can save waiting for code changes.

          Good catch, the number of segments needed can depend on the sync period, thus this does need to be exposed. Done, as a non-published option in cassandra.yaml for now.

          I tested on Linux. If I drop the page cache on the new code it doesn't generate reads. I tested the old code and it generated a few hundred megabytes of reads.

          Thank you.

          Show
          blambov Branimir Lambov added a comment - Rebased and updated here , now using ByteBuffers for compressing. Some fixes to the DeflateCompressor implementation are included. At https://github.com/apache/cassandra/compare/trunk...blambov:6809-compressed-logs#diff-d07279710c482983e537aed26df80400R340 If archiving fails it appears to delete the segment now. Is that the right thing to do? It deletes if archival was successful ( deleteFile = archiveSuccess ). The old code was doing the same thing, a bit more confusingly ( deleteFile = !archiveSuccess ? false : true ). CSLM's understanding of segment size is skewed because compressed segments are less than the expected segment size in reality. With real compression ratios it's going to be off by 30-50%. If when the size is known it's tracking could be corrected it would be nice. Unfortunately that's not trivial. Measurement must happen when the file is done writing, which is a point CLSM doesn't currently have access to; moreover that could be triggered from either sync() or close() on the segment and I don't want to include the risk that I don't get all updates working correctly into this patch. Changed the description of the parameter to reflect what it is currently measuring. This can and should be fixed soon after, though, and I'll open an issue as soon as this is committed. For the buffer pooling. I would be tempted to not wait for the collector to get to the DBB. If the DBB is promoted due to compaction or some other allocation hog it may not be reclaimed for some time. In CompressedSegment.close maybe null the field then invoke the cleaner on the buffer. There is a utility method for doing that so you don't have to access the interface directly (generates a compiler warning). Done. Also make MAX_BUFFERPOOL_SIZE configurable via a property. I have been prefixing internal C* properties with "cassandra." I suspect that at several hundred megabytes a second we will have more than 3 32 megabyte buffers in flight. I have a personal fear of shipping constants that aren't quite right and putting them all in properties can save waiting for code changes. Good catch, the number of segments needed can depend on the sync period, thus this does need to be exposed. Done, as a non-published option in cassandra.yaml for now. I tested on Linux. If I drop the page cache on the new code it doesn't generate reads. I tested the old code and it generated a few hundred megabytes of reads. Thank you.
          Hide
          benedict Benedict added a comment -

          Good catch, the number of segments needed can depend on the sync period, thus this does need to be exposed. Done, as a non-published option in cassandra.yaml for now

          What's stopping us ensuring there are always enough (or treating this as a hard limit rather than an actual number)? Seems like ensuring we have the number of buffers in flight + 1 should be enough to reach a steady state rapidly without overcommitting memory...

          Show
          benedict Benedict added a comment - Good catch, the number of segments needed can depend on the sync period, thus this does need to be exposed. Done, as a non-published option in cassandra.yaml for now What's stopping us ensuring there are always enough (or treating this as a hard limit rather than an actual number)? Seems like ensuring we have the number of buffers in flight + 1 should be enough to reach a steady state rapidly without overcommitting memory...
          Hide
          aweisberg Ariel Weisberg added a comment -

          I don't think it works as a hard limit. Filesystems can hiccup for a long time and if you buffer to private memory you avoid seeing the hiccups.

          A high watermark isn't great either because you commit memory that isn't needed most of the time. Maybe I am not following what you are suggesting.

          When we have ponies we will be writing to private memory, probably around 128 megabytes, to avoid being at the mercy of the filesystem.

          Once compression is asynchronous to the filesystem and parallel the # of buffers can be small because compression will tear through fast enough to make the buffers available again. So you would have memory waiting to drain to the filesystem (128 megabytes) and a small number of buffers to aggregate log records until they are sent for compression.

          Show
          aweisberg Ariel Weisberg added a comment - I don't think it works as a hard limit. Filesystems can hiccup for a long time and if you buffer to private memory you avoid seeing the hiccups. A high watermark isn't great either because you commit memory that isn't needed most of the time. Maybe I am not following what you are suggesting. When we have ponies we will be writing to private memory, probably around 128 megabytes, to avoid being at the mercy of the filesystem. Once compression is asynchronous to the filesystem and parallel the # of buffers can be small because compression will tear through fast enough to make the buffers available again. So you would have memory waiting to drain to the filesystem (128 megabytes) and a small number of buffers to aggregate log records until they are sent for compression.
          Hide
          benedict Benedict added a comment - - edited

          The idea of making it a "hard limit" instead of the concrete number is to fix it as something much larger than you would like it to be, but no larger than you really must expect it to go, so that it can scale gracefully with some blips and avoid those blips having severe negative repurcussions. My expectation is that whatever strategy we use here we will also use for non-compressed once we migrate to manually managed memory buffers, and it seems to me always having deallocations trail utilisation by 1 (so the first to become unused we do not deallocate, but if another becomes unused we release one of the two) probably gives us pretty good behaviour without any extra tuning knobs

          Show
          benedict Benedict added a comment - - edited The idea of making it a "hard limit" instead of the concrete number is to fix it as something much larger than you would like it to be, but no larger than you really must expect it to go, so that it can scale gracefully with some blips and avoid those blips having severe negative repurcussions. My expectation is that whatever strategy we use here we will also use for non-compressed once we migrate to manually managed memory buffers, and it seems to me always having deallocations trail utilisation by 1 (so the first to become unused we do not deallocate, but if another becomes unused we release one of the two) probably gives us pretty good behaviour without any extra tuning knobs
          Hide
          aweisberg Ariel Weisberg added a comment -

          I think what Benedict is describing where we only deallocate if the queue has more than one available (and > 3) is good idea. If the knob is set too low it might give better behavior at steady state?

          I would still be +1 since I think it won't matter once we multi-thread compression (and I intend to do that for 3.1).

          Show
          aweisberg Ariel Weisberg added a comment - I think what Benedict is describing where we only deallocate if the queue has more than one available (and > 3) is good idea. If the knob is set too low it might give better behavior at steady state? I would still be +1 since I think it won't matter once we multi-thread compression (and I intend to do that for 3.1).
          Hide
          benedict Benedict added a comment -

          I wouldn't hold up commit over that, no. Alan Boudreault are you planning to do some further testing on this before commit, or should I go ahead and commit this to trunk?

          Show
          benedict Benedict added a comment - I wouldn't hold up commit over that, no. Alan Boudreault are you planning to do some further testing on this before commit, or should I go ahead and commit this to trunk?
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          +1 to committing as is.

          Show
          iamaleksey Aleksey Yeschenko added a comment - +1 to committing as is.
          Hide
          aboudreault Alan Boudreault added a comment -

          Benedict Adding a note to retest my simple dtest with the latest branch this week. No objection to see this committed right now though.

          Show
          aboudreault Alan Boudreault added a comment - Benedict Adding a note to retest my simple dtest with the latest branch this week. No objection to see this committed right now though.
          Hide
          benedict Benedict added a comment -

          Could I just confirm that Ariel Weisberg has +1'd the final changes? The +1 is a little ambiguous, as it is contextualised wrt byte buffer pooling, and doesn't make clear the last tweaks have been +1'd.

          Show
          benedict Benedict added a comment - Could I just confirm that Ariel Weisberg has +1'd the final changes? The +1 is a little ambiguous, as it is contextualised wrt byte buffer pooling, and doesn't make clear the last tweaks have been +1'd.
          Hide
          aweisberg Ariel Weisberg added a comment -

          I did +1 the final set of changes.

          Show
          aweisberg Ariel Weisberg added a comment - I did +1 the final set of changes.
          Hide
          benedict Benedict added a comment -

          I would still be +1

          is too ambiguous for me to risk breaking trunk on

          Show
          benedict Benedict added a comment - I would still be +1 is too ambiguous for me to risk breaking trunk on
          Hide
          benedict Benedict added a comment -

          OK, committed to trunk

          Show
          benedict Benedict added a comment - OK, committed to trunk
          Hide
          dbrosius Dave Brosius added a comment - - edited

          This commit did not apply cleanly. An issue (at least) between DatabaseDescriptor.java and Config.java. Please take a look.

          oh and nit: Parameterized is spelled wrong

          Show
          dbrosius Dave Brosius added a comment - - edited This commit did not apply cleanly. An issue (at least) between DatabaseDescriptor.java and Config.java. Please take a look. oh and nit: Parameterized is spelled wrong
          Hide
          blambov Branimir Lambov added a comment -

          Fix-up and rename here, and diff.

          Show
          blambov Branimir Lambov added a comment - Fix-up and rename here , and diff .
          Hide
          aboudreault Alan Boudreault added a comment -

          I did multiple tests locally and on cstar_perf. Everything looks good so far and the performance are great. I also tested the disk usage through the different compression algorithms and the benefit were good as expected: For the record, here are the links of the main cstar_perf tests:

          http://cstar.datastax.com/tests/id/a5f48f90-007c-11e5-adde-42010af0688f
          http://cstar.datastax.com/tests/id/8f0c4dfe-007c-11e5-aa79-42010af0688f
          http://cstar.datastax.com/tests/id/de7512ea-007c-11e5-adde-42010af0688f

          The DeflateCompressor is a bit slower, but IIRC, it is expected and not very used.

          Show
          aboudreault Alan Boudreault added a comment - I did multiple tests locally and on cstar_perf. Everything looks good so far and the performance are great. I also tested the disk usage through the different compression algorithms and the benefit were good as expected: For the record, here are the links of the main cstar_perf tests: http://cstar.datastax.com/tests/id/a5f48f90-007c-11e5-adde-42010af0688f http://cstar.datastax.com/tests/id/8f0c4dfe-007c-11e5-aa79-42010af0688f http://cstar.datastax.com/tests/id/de7512ea-007c-11e5-adde-42010af0688f The DeflateCompressor is a bit slower, but IIRC, it is expected and not very used.

            People

            • Assignee:
              blambov Branimir Lambov
              Reporter:
              benedict Benedict
              Reviewer:
              Ariel Weisberg
              Tester:
              Alan Boudreault
            • Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development