Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Fix Version/s: 0.8 beta 1
    • Component/s: None
    • Labels:
      None

      Description

      Compaction is currently relatively bursty: we compact as fast as we can, and then we wait for the next compaction to be possible ("hurry up and wait").

      Instead, to properly amortize compaction, you'd like to compact exactly as fast as you need to to keep the sstable count under control.

      For every new level of compaction, you need to increase the rate that you compact at: a rule of thumb that we're testing on our clusters is to determine the maximum number of buckets a node can support (aka, if the 15th bucket holds 750 GB, we're not going to have more than 15 buckets), and then multiply the flush throughput by the number of buckets to get a minimum compaction throughput to maintain your sstable count.

      Full explanation: for a min compaction threshold of T, the bucket at level N can contain SsubN = T^N 'units' (unit == memtable's worth of data on disk). Every time a new unit is added, it has a 1/SsubN chance of causing the bucket at level N to fill. If the bucket at level N fills, it causes SsubN units to be compacted. So, for each active level in your system you have SubN * 1 / SsubN, or 1 amortized unit to compact any time a new unit is added.

        Issue Links

          Activity

          Hide
          Stu Hood added a comment -

          Attaching a patch for 0.6 that implements compaction throttling for a fixed value.

          Since it is relatively easy to automatically figure out the proper throughput, we might want to make throttling automatic rather than exposing a config option.

          Show
          Stu Hood added a comment - Attaching a patch for 0.6 that implements compaction throttling for a fixed value. Since it is relatively easy to automatically figure out the proper throughput, we might want to make throttling automatic rather than exposing a config option.
          Hide
          Jonathan Ellis added a comment -

          Related: CASSANDRA-1882.

          I'm pretty uncomfortable committing changes to 0.6 compaction at this point.

          0.7 is (looks furtively over his shoulder) probably ok, if it defaults to off.

          Show
          Jonathan Ellis added a comment - Related: CASSANDRA-1882 . I'm pretty uncomfortable committing changes to 0.6 compaction at this point. 0.7 is ( looks furtively over his shoulder ) probably ok, if it defaults to off.
          Hide
          Stu Hood added a comment -

          > I'm pretty uncomfortable committing changes to 0.6 compaction at this point.
          Oh yea... I mostly posted this particular version for rcoli's benefit: it should go into trunk, and could probably be slipped into 0.7, depending on what the final patch looks like.

          Show
          Stu Hood added a comment - > I'm pretty uncomfortable committing changes to 0.6 compaction at this point. Oh yea... I mostly posted this particular version for rcoli's benefit: it should go into trunk, and could probably be slipped into 0.7, depending on what the final patch looks like.
          Hide
          Stu Hood added a comment -

          Linking to multithreaded compaction, since multithreaded compaction without throttling is even more bursty.

          Show
          Stu Hood added a comment - Linking to multithreaded compaction, since multithreaded compaction without throttling is even more bursty.
          Hide
          Stu Hood added a comment -

          Attaching a patch for trunk that applies atop CASSANDRA-2191, and shares the total compaction throughput between all active compactions.

          CASSANDRA-2171 is still valid, and we have a new guy on our team working on it, but it seemed appropriate to not hold back this patch too long. For now, a fixed rate can be specified via a config setting.

          Show
          Stu Hood added a comment - Attaching a patch for trunk that applies atop CASSANDRA-2191 , and shares the total compaction throughput between all active compactions. CASSANDRA-2171 is still valid, and we have a new guy on our team working on it, but it seemed appropriate to not hold back this patch too long. For now, a fixed rate can be specified via a config setting.
          Hide
          Stu Hood added a comment -

          Rebased for trunk: still applies atop CASSANDRA-2191.

          Show
          Stu Hood added a comment - Rebased for trunk: still applies atop CASSANDRA-2191 .
          Hide
          amorton added a comment -

          Bunch of questions again as I'm trying to understand some more of whats going on.

          1. if compaction_throughput_kb_per_sec is always going be megabytes should it change to MB
          2. Not in your changes but CompactionIterator.close() will stop closing files after the first one fails.
          3. I'm guessing most of the time the actual and target throughput will not match. How about moving the INFO message in throttle() to the DEBUG level? Or only logging at INFO is the thread will sleep?
          4. Should there be a config setting to turn throttling on and off? Could setting compaction_throughput_kb_per_sec to 0 disable it ?
          5. For my understanding: Is there a case for making the sampling interval in CompactionIterator.getReduce() configurable? Would we want different settings for fewer big rows vs many small rows. e.g. two CFs where one is a secondary index for rows in the other, could be millions of cols in one an a few in another.

          I dont understand the approach to deciding what value compaction_throughput_kb_per_sec should have. Can you add some more info and clarify if you are talking about the per CF buckets creating during Compaction?

          Final question. Would it be better to have fewer parallel compactions where each compaction completes quickly, than more parallel compactions that take longer to complete. Assuming that once compaction has finished read performance and disk usage may improve. If so would limiting compaction by sizing the compaction thread pool be effective? (I guess the down side may be starvation for some CF's)

          Show
          amorton added a comment - Bunch of questions again as I'm trying to understand some more of whats going on. if compaction_throughput_kb_per_sec is always going be megabytes should it change to MB Not in your changes but CompactionIterator.close() will stop closing files after the first one fails. I'm guessing most of the time the actual and target throughput will not match. How about moving the INFO message in throttle() to the DEBUG level? Or only logging at INFO is the thread will sleep? Should there be a config setting to turn throttling on and off? Could setting compaction_throughput_kb_per_sec to 0 disable it ? For my understanding: Is there a case for making the sampling interval in CompactionIterator.getReduce() configurable? Would we want different settings for fewer big rows vs many small rows. e.g. two CFs where one is a secondary index for rows in the other, could be millions of cols in one an a few in another. I dont understand the approach to deciding what value compaction_throughput_kb_per_sec should have. Can you add some more info and clarify if you are talking about the per CF buckets creating during Compaction? Final question. Would it be better to have fewer parallel compactions where each compaction completes quickly, than more parallel compactions that take longer to complete. Assuming that once compaction has finished read performance and disk usage may improve. If so would limiting compaction by sizing the compaction thread pool be effective? (I guess the down side may be starvation for some CF's)
          Hide
          Stu Hood added a comment - - edited

          1. Fixed
          2. Added and used FileUtils.close(Collection<Closeable>)
          3. targetBytesPerMS only changes when the number of active threads changes: it leads to nice (imo) periodic feedback of running compactions in the log when compactions start or finish
          4. Assuming compaction multithreading makes it in, throttling should never be disabled... for someone who really wants to disable it, setting it to a high enough value that it never kicks in should be sufficient?
          5. Maybe... but dynamically adjusting the frequency at which we throttle and update bytesRead would probably be better to do in another ticket?


          Regarding the approach to setting compaction_throughput_mb_per_sec: each bucket probably contains MIN_THRESHOLD times more data than the previous bucket, and needs to be compacted 1 / MIN_THRESHOLD times as often (see the math in the description). This means that the number of buckets influences how fast you need to compact, and that each additional bucket adds a linear amount of necessary throughput (+ 1x your flush rate). Therefore, if you have 15 bucket levels, and you are flushing 1 MB/s, you need to compact at 1 MB/s * 15.

          As an example: with MIN_THRESHOLD=2, each bucket is twice is large as the previous. Say that we have 4 levels (buckets of sizes 1, 2, 4, 8) and that we need a compaction in the largest bucket. The amount of data that needs to be compacted in that bucket will be equal to 1 more than the sum of the sizes of all the other buckets (1 + 2 + 4 == 8 - 1). So, ideally we would be able to compact those 8 units in exactly the time it takes for 1 more unit to be flushed, and for the compactions of the other buckets to trickle up and refill the largest bucket. Pheew?

          CASSANDRA-2171 will allow us to calculate the flush rate, which we can then multiply by the count of buckets (note... one tiny missing piece is determining how many buckets are "empty": an empty bucket is not created in the current approach).


          > Final question. Would it be better to have fewer parallel compactions
          As a base case, with no parallelism at all, you will fall behind on compaction, because every new bucket is a chance to compact. It's a fundamental question, but I haven't thought about it... sorry.

          Show
          Stu Hood added a comment - - edited 1. Fixed 2. Added and used FileUtils.close(Collection<Closeable>) 3. targetBytesPerMS only changes when the number of active threads changes: it leads to nice (imo) periodic feedback of running compactions in the log when compactions start or finish 4. Assuming compaction multithreading makes it in, throttling should never be disabled... for someone who really wants to disable it, setting it to a high enough value that it never kicks in should be sufficient? 5. Maybe... but dynamically adjusting the frequency at which we throttle and update bytesRead would probably be better to do in another ticket? Regarding the approach to setting compaction_throughput_mb_per_sec: each bucket probably contains MIN_THRESHOLD times more data than the previous bucket, and needs to be compacted 1 / MIN_THRESHOLD times as often (see the math in the description). This means that the number of buckets influences how fast you need to compact, and that each additional bucket adds a linear amount of necessary throughput (+ 1x your flush rate). Therefore, if you have 15 bucket levels, and you are flushing 1 MB/s , you need to compact at 1 MB/s * 15 . As an example: with MIN_THRESHOLD=2 , each bucket is twice is large as the previous. Say that we have 4 levels (buckets of sizes 1, 2, 4, 8) and that we need a compaction in the largest bucket. The amount of data that needs to be compacted in that bucket will be equal to 1 more than the sum of the sizes of all the other buckets (1 + 2 + 4 == 8 - 1). So, ideally we would be able to compact those 8 units in exactly the time it takes for 1 more unit to be flushed, and for the compactions of the other buckets to trickle up and refill the largest bucket. Pheew? CASSANDRA-2171 will allow us to calculate the flush rate, which we can then multiply by the count of buckets (note... one tiny missing piece is determining how many buckets are "empty": an empty bucket is not created in the current approach). > Final question. Would it be better to have fewer parallel compactions As a base case, with no parallelism at all, you will fall behind on compaction, because every new bucket is a chance to compact. It's a fundamental question, but I haven't thought about it... sorry.
          Hide
          Ryan King added a comment -

          This has been a big improvement for us in production. It'd be nice to get more eyes on it for 0.8.

          Show
          Ryan King added a comment - This has been a big improvement for us in production. It'd be nice to get more eyes on it for 0.8.
          Hide
          Sylvain Lebresne added a comment -

          I think this will quite a useful patch.

          Dividing the total compaction rate by the number of active compaction to determine each given active compaction rate may be a bit coarse-grained in some situations, but it's also probably good enough and I'm fine letting that as further improvement if it happens that it needs to be improved.

          Also, we may want ultimately to throttle cleanup compaction too and maybe have a specific rate for validation compaction. But I'm fine having it as another ticket.

          A few comments:

          • A MB is 1024 * 1024 bytes, and a ms is 1000 seconds. I think the definition of CompactionIterator.THROTTLE_BYTES_PER_MS takes liberties with standard units .
          • We should really allow 0 for the compaction rate to deactivate throttling (and that should really throttle() completely), if only because bugs exist.
          • To have compaction rate changeable live would be pretty cool and it's super easy (an AtomicInteger for THROTTLE_BYTES_PER_MS with some jmx call in CompactionManager to change it should be enough), so let's do it now.
          • In theory, there is a risk of division by 0 because targetBytesPerMs can be 0. Granted this is more than unlikely given that the minimum value for THROTTLE is 1024, but nevertheless, let's be on the safe side.
          • In the same idea, excessBytes can be negative. Pretty sure sleep just assumes that any negative number is 0, but it would be better to actually check for all those limit case.
          • I'd also be in favor of having the logging in changes of targetByteInMS at debug level. Because there'll be one message each time you start a compaction and n messages each time the number of active compaction change and we'll print them even though we doesn't throttle anything, so it will be noise for most people. Anyway, really no big deal.
          Show
          Sylvain Lebresne added a comment - I think this will quite a useful patch. Dividing the total compaction rate by the number of active compaction to determine each given active compaction rate may be a bit coarse-grained in some situations, but it's also probably good enough and I'm fine letting that as further improvement if it happens that it needs to be improved. Also, we may want ultimately to throttle cleanup compaction too and maybe have a specific rate for validation compaction. But I'm fine having it as another ticket. A few comments: A MB is 1024 * 1024 bytes, and a ms is 1000 seconds. I think the definition of CompactionIterator.THROTTLE_BYTES_PER_MS takes liberties with standard units . We should really allow 0 for the compaction rate to deactivate throttling (and that should really throttle() completely), if only because bugs exist. To have compaction rate changeable live would be pretty cool and it's super easy (an AtomicInteger for THROTTLE_BYTES_PER_MS with some jmx call in CompactionManager to change it should be enough), so let's do it now. In theory, there is a risk of division by 0 because targetBytesPerMs can be 0. Granted this is more than unlikely given that the minimum value for THROTTLE is 1024, but nevertheless, let's be on the safe side. In the same idea, excessBytes can be negative. Pretty sure sleep just assumes that any negative number is 0, but it would be better to actually check for all those limit case. I'd also be in favor of having the logging in changes of targetByteInMS at debug level. Because there'll be one message each time you start a compaction and n messages each time the number of active compaction change and we'll print them even though we doesn't throttle anything, so it will be noise for most people. Anyway, really no big deal.
          Hide
          Stu Hood added a comment -
          • Fixed bytes-per-ms calculation
          • Allow disabling compaction throttling
          • Add JMX method to adjust throttling
          • Added div-by-zero protection for targetBytesPerMS
          • excessBytes being negative doesn't matter because we check for a positive value before sleeping

          Still applies atop #2191.

          Show
          Stu Hood added a comment - Fixed bytes-per-ms calculation Allow disabling compaction throttling Add JMX method to adjust throttling Added div-by-zero protection for targetBytesPerMS excessBytes being negative doesn't matter because we check for a positive value before sleeping Still applies atop #2191.
          Hide
          amorton added a comment -

          From a discussion on the user list http://www.mail-archive.com/user@cassandra.apache.org/msg12027.html

          CompactionManager.submitSSTableBuild() and submitIndexBuild() are used when receiving streams from other nodes. But they do not use the CompactionIterator() so are not covered by this ticket.

          Want to create another ticket just for those tasks or reopen CASSANDRA-1882 and punt it to a future version?

          Show
          amorton added a comment - From a discussion on the user list http://www.mail-archive.com/user@cassandra.apache.org/msg12027.html CompactionManager.submitSSTableBuild() and submitIndexBuild() are used when receiving streams from other nodes. But they do not use the CompactionIterator() so are not covered by this ticket. Want to create another ticket just for those tasks or reopen CASSANDRA-1882 and punt it to a future version?
          Hide
          Stu Hood added a comment - - edited

          I'd prefer to tackle those in a future ticket... for one thing, I don't think it is clear cut whether we should throttle them.

          EDIT: ... because I suspect that the actual file transfer causes more load.

          Show
          Stu Hood added a comment - - edited I'd prefer to tackle those in a future ticket... for one thing, I don't think it is clear cut whether we should throttle them. EDIT: ... because I suspect that the actual file transfer causes more load.
          Hide
          Stu Hood added a comment -

          Rebased: ready for review.

          Show
          Stu Hood added a comment - Rebased: ready for review.
          Hide
          Sylvain Lebresne added a comment -

          Commited as r1090979, thanks.

          Show
          Sylvain Lebresne added a comment - Commited as r1090979, thanks.
          Hide
          Hudson added a comment -

          Integrated in Cassandra #848 (See https://hudson.apache.org/hudson/job/Cassandra/848/)
          Compaction throttling
          patch by stuhood; reviewed by slebresne for CASSANDRA-2156

          Show
          Hudson added a comment - Integrated in Cassandra #848 (See https://hudson.apache.org/hudson/job/Cassandra/848/ ) Compaction throttling patch by stuhood; reviewed by slebresne for CASSANDRA-2156

            People

            • Assignee:
              Stu Hood
              Reporter:
              Stu Hood
              Reviewer:
              Sylvain Lebresne
            • Votes:
              1 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development