Cassandra
  1. Cassandra
  2. CASSANDRA-5371

Perform size-tiered compactions in L0 ("hybrid compaction")

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Fix Version/s: 2.0 beta 1
    • Component/s: Core
    • Labels:
      None

      Description

      If LCS gets behind, read performance deteriorates as we have to check bloom filters on man sstables in L0. For wide rows, this can mean having to seek for each one since the BF doesn't help us reject much.

      Performing size-tiered compaction in L0 will mitigate this until we can catch up on merging it into higher levels.

        Issue Links

          Activity

          Hide
          Jeremiah Jordan added a comment -

          I could see this possibly helping if you left the non L0 alone and went to STCS for L0 until the write load stopped. But you have to come up with a heuristic to know when to shut off LCS. So a cluster with a periodic write load, which was too high for LCS's increased IO needs, would revert to STCS of L0 only until the load dropped. You would then have to play catchup shoving all that L0 data into the other levels. I could see use cases where this would be useful, such as periodic large data dumps into a cluster. You would have to be careful that there was enough down time between dumps for LCS to catchup.

          Show
          Jeremiah Jordan added a comment - I could see this possibly helping if you left the non L0 alone and went to STCS for L0 until the write load stopped. But you have to come up with a heuristic to know when to shut off LCS. So a cluster with a periodic write load, which was too high for LCS's increased IO needs, would revert to STCS of L0 only until the load dropped. You would then have to play catchup shoving all that L0 data into the other levels. I could see use cases where this would be useful, such as periodic large data dumps into a cluster. You would have to be careful that there was enough down time between dumps for LCS to catchup.
          Hide
          T Jake Luciani added a comment -

          This initial version puts the newly flushed memtables into a queue and when there are 4 it size tiers them. So you get 1/4 the sstables in L0.

          Show
          T Jake Luciani added a comment - This initial version puts the newly flushed memtables into a queue and when there are 4 it size tiers them. So you get 1/4 the sstables in L0.
          Hide
          Jonathan Ellis added a comment -

          Alternate implementation pushed to http://github.com/jbellis/cassandra/commits/5371 with the following improvements:

          • Only applies STCS to L0 if L0 gets behind (defined as "accumulates more than MAX_COMPACTING_L0 sstables")
          • Performs true STCS, rather than "compact in sets of four and then never again"
          Show
          Jonathan Ellis added a comment - Alternate implementation pushed to http://github.com/jbellis/cassandra/commits/5371 with the following improvements: Only applies STCS to L0 if L0 gets behind (defined as "accumulates more than MAX_COMPACTING_L0 sstables") Performs true STCS, rather than "compact in sets of four and then never again"
          Hide
          T Jake Luciani added a comment -

          Oh good, this is what I wanted the implementation to end up being.

          In LeveledManifest.getCompactionCandidates:

          I think there is a bug in the size tier candidate checks. You seem to be size tiering across all the non-compacting sstables and not the level0 ones. I think you mean't to intersect the level0 sstables with the non-compacting ones. You should also add a check after that to make sure the non-compacting level0 sstables are still > MAX_COMPACTING_L0

          Also, the code only checks for STCS when a higher level is ready to be compacted. Maybe move this to the top before the higher level checks. We know the higher levels are seek bounded but the code should try to keep up with level 0 flushes as much as possible.

          Show
          T Jake Luciani added a comment - Oh good, this is what I wanted the implementation to end up being. In LeveledManifest.getCompactionCandidates: I think there is a bug in the size tier candidate checks. You seem to be size tiering across all the non-compacting sstables and not the level0 ones. I think you mean't to intersect the level0 sstables with the non-compacting ones. You should also add a check after that to make sure the non-compacting level0 sstables are still > MAX_COMPACTING_L0 Also, the code only checks for STCS when a higher level is ready to be compacted. Maybe move this to the top before the higher level checks. We know the higher levels are seek bounded but the code should try to keep up with level 0 flushes as much as possible.
          Hide
          Jonathan Ellis added a comment -

          I think you mean't to intersect the level0 sstables with the non-compacting ones

          Right. Fix pushed.

          the code only checks for STCS when a higher level is ready to be compacted

          The idea is, we'd prefer to do normal LCS compaction on L0. So if the higher levels are okay, we'll treat L0 the same as before. But if we do need to compact a higher level, we'll first check and see if L0 is far enough behind that we should do an STCS round there as a stop-gap.

          You should also add a check after that to make sure the non-compacting level0 sstables are still > MAX_COMPACTING_L0

          I think it's more correct as written – basically, we're doing L0 out-of-turn, since for max throughput we'd do the higher level next. So, we'll do L0 STCS until it's under MCL0, then we'll go back to the higher levels until we catch up and can actually apply leveling to L0.

          Show
          Jonathan Ellis added a comment - I think you mean't to intersect the level0 sstables with the non-compacting ones Right. Fix pushed. the code only checks for STCS when a higher level is ready to be compacted The idea is, we'd prefer to do normal LCS compaction on L0. So if the higher levels are okay, we'll treat L0 the same as before. But if we do need to compact a higher level, we'll first check and see if L0 is far enough behind that we should do an STCS round there as a stop-gap. You should also add a check after that to make sure the non-compacting level0 sstables are still > MAX_COMPACTING_L0 I think it's more correct as written – basically, we're doing L0 out-of-turn, since for max throughput we'd do the higher level next. So, we'll do L0 STCS until it's under MCL0, then we'll go back to the higher levels until we catch up and can actually apply leveling to L0.
          Hide
          Sylvain Lebresne added a comment -

          For my own curiosity, do we have performance numbers for this (including, not only on SSD tests)?

          A priori, I'm not fully sold on this being always a win (or even most of the time of "L0 is being"). That is, I understand the reasoning that lots of SSTables in L0 is bad for reads, but at the same time, if you compact STCS things in L0, a lot of the work you've done you will redo when you compact your now bigger L0 sstable against L1. I.e. those STCS compactions don't help you make progress as far as leveling is concerned, so it seems like it waste work overall. Besides, in theory, our LCS is supposed to be able to compact large amount of L0 sstables into L1 to help with the "I'm behind on L0 but it's just a pike in load". Now I guess if you've pushed a lot of data in L1 and get behind again in L0, then it's not fun because all of L1 need to be including in L0 compaction. But if you are constantly behind, doesn't that mean you have bigger problems (and/or that you should just use STSC)?

          Basically I wonder if there won't be a number of scenario where because you get a bit behind on L0 once, then the I/O you "waste" doing STSC in L0 will help you get even more and more behind on your leveling and you'd end up doing mostly STSC, while letting LCS do its job would have been fine overall.

          That is, I'm happy with this if that makes things clearly better in practice more often than not, it's just that intellectually it's not obvious to me that it's the case (note that I'm not saying that it's obvious it's a bad idea either).

          Show
          Sylvain Lebresne added a comment - For my own curiosity, do we have performance numbers for this (including, not only on SSD tests)? A priori, I'm not fully sold on this being always a win (or even most of the time of "L0 is being"). That is, I understand the reasoning that lots of SSTables in L0 is bad for reads, but at the same time, if you compact STCS things in L0, a lot of the work you've done you will redo when you compact your now bigger L0 sstable against L1. I.e. those STCS compactions don't help you make progress as far as leveling is concerned, so it seems like it waste work overall. Besides, in theory, our LCS is supposed to be able to compact large amount of L0 sstables into L1 to help with the "I'm behind on L0 but it's just a pike in load". Now I guess if you've pushed a lot of data in L1 and get behind again in L0, then it's not fun because all of L1 need to be including in L0 compaction. But if you are constantly behind, doesn't that mean you have bigger problems (and/or that you should just use STSC)? Basically I wonder if there won't be a number of scenario where because you get a bit behind on L0 once, then the I/O you "waste" doing STSC in L0 will help you get even more and more behind on your leveling and you'd end up doing mostly STSC, while letting LCS do its job would have been fine overall. That is, I'm happy with this if that makes things clearly better in practice more often than not, it's just that intellectually it's not obvious to me that it's the case (note that I'm not saying that it's obvious it's a bad idea either).
          Hide
          T Jake Luciani added a comment - - edited

          Jonathan Ellis let me run some tests but the code looks good now.

          Sylvain Lebresne

          do we have performance numbers for this (including, not only on SSD tests)?

          We only have SSD and I think LCS only ever makes sense on SSD. If we want to support HDD then I agree this is def more IO overall. The performance numbers for our use case went from all read timeouts using LCS to reads rarely timing out with the original patch. The stress tool doesn't have a wide row scenario so it's hard to simulate out of the box.

          doesn't that mean you have bigger problems (and/or that you should just use STSC)

          You are right, this does require writes die down at somepoint otherwise you end up with STCS. ellis mentions this in the comments.

          STCS isn't viable for the LCS use cases. I don't see how having this (on SSD) would not help all LCS use cases since LCS is for wide row or heavy updates. The point of this is to avoid the situation where all sstables in L0 contain a portion of the row which requires reading them all. One thing to keep in mind is if you do have a wide row and you end up with a STCS compacted row of 10MB and LCS has a 5MB limit you still end up with a 10MB sstable with a single row in it so the higher levels do benefit from STCS in this case.

          Show
          T Jake Luciani added a comment - - edited Jonathan Ellis let me run some tests but the code looks good now. Sylvain Lebresne do we have performance numbers for this (including, not only on SSD tests)? We only have SSD and I think LCS only ever makes sense on SSD. If we want to support HDD then I agree this is def more IO overall. The performance numbers for our use case went from all read timeouts using LCS to reads rarely timing out with the original patch. The stress tool doesn't have a wide row scenario so it's hard to simulate out of the box. doesn't that mean you have bigger problems (and/or that you should just use STSC) You are right, this does require writes die down at somepoint otherwise you end up with STCS. ellis mentions this in the comments. STCS isn't viable for the LCS use cases. I don't see how having this (on SSD) would not help all LCS use cases since LCS is for wide row or heavy updates. The point of this is to avoid the situation where all sstables in L0 contain a portion of the row which requires reading them all. One thing to keep in mind is if you do have a wide row and you end up with a STCS compacted row of 10MB and LCS has a 5MB limit you still end up with a 10MB sstable with a single row in it so the higher levels do benefit from STCS in this case.
          Hide
          Sylvain Lebresne added a comment -

          The stress tool doesn't have a wide row scenario so it's hard to simulate out of the box

          Agreed, and that's definitively lacking. I believe there is a few knobs that allow to do wideish rows, but that's probably not very realistic.

          I think LCS only ever makes sense on SSD
          since LCS is for wide row or heavy updates

          I'm not sure I agree. Maybe there is some truth to it in our current implementation, but that would then be more of a quirk of the implementation that the goal. Typically, I'm not really sure why only wide rows would benefit it. There is certainly nothing in theory that makes it so. As for "it's for heavy updates only", I think that LCS has a number of nice properties (like avoiding huge files that require half of you disk in free space) that are nice even if you have a moderate to low update rate (and in that case you can definitively afford LCS on HDD). More concretely, I'm pretty sure we have tons on users on LCS on HDD.

          Anyway, all this to say that I don't necessary agree on optimizing LCS for heavy writes + wide rows + SSD if that's done at the expense of all other type of workload (and I'm not saying that's what this patch is doing, just that discarding other type of workload as unimportant is not ok imo).

          LCS has a 5MB limit you still end up with a 10MB sstable with a single row

          If having 10MB sstables being split due to row too wide is a problem, then you should either not use LCS or pick a 10MB limit for LCS, not 5MB.

          Anyway, I'm not vetoing this or anything like this. Just trying to get a better understanding of why this is a good thing to do in general.

          Show
          Sylvain Lebresne added a comment - The stress tool doesn't have a wide row scenario so it's hard to simulate out of the box Agreed, and that's definitively lacking. I believe there is a few knobs that allow to do wideish rows, but that's probably not very realistic. I think LCS only ever makes sense on SSD since LCS is for wide row or heavy updates I'm not sure I agree. Maybe there is some truth to it in our current implementation, but that would then be more of a quirk of the implementation that the goal. Typically, I'm not really sure why only wide rows would benefit it. There is certainly nothing in theory that makes it so. As for "it's for heavy updates only", I think that LCS has a number of nice properties (like avoiding huge files that require half of you disk in free space) that are nice even if you have a moderate to low update rate (and in that case you can definitively afford LCS on HDD). More concretely, I'm pretty sure we have tons on users on LCS on HDD. Anyway, all this to say that I don't necessary agree on optimizing LCS for heavy writes + wide rows + SSD if that's done at the expense of all other type of workload (and I'm not saying that's what this patch is doing, just that discarding other type of workload as unimportant is not ok imo). LCS has a 5MB limit you still end up with a 10MB sstable with a single row If having 10MB sstables being split due to row too wide is a problem, then you should either not use LCS or pick a 10MB limit for LCS, not 5MB. Anyway, I'm not vetoing this or anything like this. Just trying to get a better understanding of why this is a good thing to do in general.
          Hide
          Jonathan Ellis added a comment -

          Basically I wonder if there won't be a number of scenario where because you get a bit behind on L0 once, then the I/O you "waste" doing STSC in L0 will help you get even more and more behind on your leveling

          That's exact;y the case, which is why we only apply STCS to L0 when it's fairly badly behind, i.e., we can conclude two things:

          1. if the current workload continues, it's not going to magically catch up any time soon
          2. reads are starting to get into trouble

          Note that #2 will cause a vicious cycle, slowing down compaction in turn.

          So while I can hypothesize workloads that burst just long enough to cause STCS to kick in before stopping, thus "wasting" iops, I think for the vast majority this is a good "safety valve," and specifically not worth adding a config option to disable.

          However, I do think it's worth creating a ticket to allow STCS config options to be applied to the size-tiering done by LCS, and specifically allow configuring MAX_COMPACTION_L0 via the max sstables threshold, which I think may adequately address your concern.

          Show
          Jonathan Ellis added a comment - Basically I wonder if there won't be a number of scenario where because you get a bit behind on L0 once, then the I/O you "waste" doing STSC in L0 will help you get even more and more behind on your leveling That's exact;y the case, which is why we only apply STCS to L0 when it's fairly badly behind, i.e., we can conclude two things: if the current workload continues, it's not going to magically catch up any time soon reads are starting to get into trouble Note that #2 will cause a vicious cycle, slowing down compaction in turn. So while I can hypothesize workloads that burst just long enough to cause STCS to kick in before stopping, thus "wasting" iops, I think for the vast majority this is a good "safety valve," and specifically not worth adding a config option to disable. However, I do think it's worth creating a ticket to allow STCS config options to be applied to the size-tiering done by LCS, and specifically allow configuring MAX_COMPACTION_L0 via the max sstables threshold, which I think may adequately address your concern.
          Hide
          T Jake Luciani added a comment -

          I'm going to test this out to show how it helps our workload.

          In the meantime I think this is fine to commit for 2.0 if you'd like to get it in now.

          Show
          T Jake Luciani added a comment - I'm going to test this out to show how it helps our workload. In the meantime I think this is fine to commit for 2.0 if you'd like to get it in now.
          Hide
          Jonathan Ellis added a comment -

          All right. Rebased and committed, and created CASSANDRA-5439 for the options application.

          Show
          Jonathan Ellis added a comment - All right. Rebased and committed, and created CASSANDRA-5439 for the options application.
          Hide
          Rick Branson added a comment -

          Is this just waiting on T Jake Luciani's test to backport to 1.2?

          Yesterday we bootstrapped our first new node on our first LCS cluster where each node only had ~50GB of data, and it took 6 hours to complete the bootstrap, even after running the CPUs hot by bumping compaction throughput up to 64MB. We probably could have stood to raise this to 128MB/sec and pegged them, but I dread to think of what this would be like if we moved some larger, read-heavy data sets to Cassandra under LCS. Jake seems to think this patch will help with that.

          http://i.imgur.com/LpdAKyc.png
          http://i.imgur.com/ZsgEB9G.png

          This is on an EC2 hi1.4xlarge, which is a 16-core box w/60GB RAM, 2TB of SSD storage, and 10GigE.

          We also have a cluster of m1.xlarges (4-core, 15G, 2TB rust) each with ~300GB of relatively cold data under STCS. Considering the spinning rust cluster w/1GigE and 16MB/s compaction throughput can bootstrap a new node in < 2 hours with 6x as much data we will definitely be trying this HCS on the SSD cluster running LCS at the moment.

          Show
          Rick Branson added a comment - Is this just waiting on T Jake Luciani 's test to backport to 1.2? Yesterday we bootstrapped our first new node on our first LCS cluster where each node only had ~50GB of data, and it took 6 hours to complete the bootstrap, even after running the CPUs hot by bumping compaction throughput up to 64MB. We probably could have stood to raise this to 128MB/sec and pegged them, but I dread to think of what this would be like if we moved some larger, read-heavy data sets to Cassandra under LCS. Jake seems to think this patch will help with that. http://i.imgur.com/LpdAKyc.png http://i.imgur.com/ZsgEB9G.png This is on an EC2 hi1.4xlarge, which is a 16-core box w/60GB RAM, 2TB of SSD storage, and 10GigE. We also have a cluster of m1.xlarges (4-core, 15G, 2TB rust) each with ~300GB of relatively cold data under STCS. Considering the spinning rust cluster w/1GigE and 16MB/s compaction throughput can bootstrap a new node in < 2 hours with 6x as much data we will definitely be trying this HCS on the SSD cluster running LCS at the moment.
          Hide
          Jonathan Ellis added a comment -

          I'm not backporting this to a stable release. It's a lot more involved than the 4KB proof of concept. You can probably collaborate w/ Jake on a 1.2-appropriate alternative though.

          That said, the main benefit of this is not that it magically makes LCS faster (it doesn't), but that when it does get behind your reads don't suffer so much.

          Show
          Jonathan Ellis added a comment - I'm not backporting this to a stable release. It's a lot more involved than the 4KB proof of concept. You can probably collaborate w/ Jake on a 1.2-appropriate alternative though. That said, the main benefit of this is not that it magically makes LCS faster (it doesn't), but that when it does get behind your reads don't suffer so much.
          Hide
          T Jake Luciani added a comment -

          Right, it takes a long time but it will keep reads happier.

          Why do you need LCS on your dataset Rick? is it a wide row?

          Show
          T Jake Luciani added a comment - Right, it takes a long time but it will keep reads happier. Why do you need LCS on your dataset Rick? is it a wide row?
          Hide
          Bartłomiej Romański added a comment - - edited

          Hi,

          We hit the same bug in production recently. We walked around it by switching to STCS for a few days, letting it stabilize and then going back to LCS. Quite long, but fully successful trip.

          In our case we have a lot of sstables at L0 as a result of migration. Because of another bug in sstableloader (CASSANDRA-6527), we finally ended up simply copying all sstable files from the old cluster to the new one.

          After the migration we had over 10k sstables (160MB per file) on each node. Of course, STCS-fallback activates automatically in that case.

          I wonder if similar situation will happen after the classic bootstrap? Will streaming during bootstrapping put sstables at L0 or at the original level?

          If it will put them all at L0 then I'm not sure if falling back to STCS is the best way to handle the situation. I've read the comment in the code and I'm aware why it is a good thing to do if we have to many sstables at L0 as a result of too many random inserts. We have a lot of sstables, each of them covers the whole ring, there's simply no better option.

          However, after the bootstrap situation looks a bit different. The loaded sstables already have vary small ranges! We just have to tidy up a bit and everything should be OK. STCS ignores that completely and after a while we have a bit less sstables but each of them covers the whole ring instead of just a small part. I believe that in that case letting LCS do the job is a better option that allowing STCS mix everything up before.

          Is there a way to disable STCS fallback? I'll be glad to test this option the next time we do similar operation.

          Show
          Bartłomiej Romański added a comment - - edited Hi, We hit the same bug in production recently. We walked around it by switching to STCS for a few days, letting it stabilize and then going back to LCS. Quite long, but fully successful trip. In our case we have a lot of sstables at L0 as a result of migration. Because of another bug in sstableloader ( CASSANDRA-6527 ), we finally ended up simply copying all sstable files from the old cluster to the new one. After the migration we had over 10k sstables (160MB per file) on each node. Of course, STCS-fallback activates automatically in that case. I wonder if similar situation will happen after the classic bootstrap? Will streaming during bootstrapping put sstables at L0 or at the original level? If it will put them all at L0 then I'm not sure if falling back to STCS is the best way to handle the situation. I've read the comment in the code and I'm aware why it is a good thing to do if we have to many sstables at L0 as a result of too many random inserts. We have a lot of sstables, each of them covers the whole ring, there's simply no better option. However, after the bootstrap situation looks a bit different. The loaded sstables already have vary small ranges! We just have to tidy up a bit and everything should be OK. STCS ignores that completely and after a while we have a bit less sstables but each of them covers the whole ring instead of just a small part. I believe that in that case letting LCS do the job is a better option that allowing STCS mix everything up before. Is there a way to disable STCS fallback? I'll be glad to test this option the next time we do similar operation.
          Hide
          Ravi Prasad added a comment -

          +1 on Bartłomiej Romański comment.
          Even during dead node replace (using replace_address), streaming puts all sstables in L0. 2.0.x switches to STCS, in doing so, also creates larger sstables, which means more free disk space to be left, in order for them to be compacted later into higher levels. LCS is known to lower the amount of free disk space (headroom) needed for compaction. this is no more true with LCS in above scenarios.
          Is there a way to disable STCS fallback, please?

          Show
          Ravi Prasad added a comment - +1 on Bartłomiej Romański comment. Even during dead node replace (using replace_address), streaming puts all sstables in L0. 2.0.x switches to STCS, in doing so, also creates larger sstables, which means more free disk space to be left, in order for them to be compacted later into higher levels. LCS is known to lower the amount of free disk space (headroom) needed for compaction. this is no more true with LCS in above scenarios. Is there a way to disable STCS fallback, please?
          Hide
          Bartłomiej Romański added a comment -

          Ravi, could you confirm that streaming puts all tables in L0? In that case I think we should open a separate issue instead of commenting on a closed one.

          Show
          Bartłomiej Romański added a comment - Ravi, could you confirm that streaming puts all tables in L0? In that case I think we should open a separate issue instead of commenting on a closed one.
          Hide
          Brandon Williams added a comment -

          They have to go to L0 since preserving the level across machines doesn't make any sense. Please do open a new issue.

          Show
          Brandon Williams added a comment - They have to go to L0 since preserving the level across machines doesn't make any sense. Please do open a new issue.
          Hide
          Bartłomiej Romański added a comment -

          I've just created CASSANDRA-6621 describing the issue from the last comments.

          Show
          Bartłomiej Romański added a comment - I've just created CASSANDRA-6621 describing the issue from the last comments.

            People

            • Assignee:
              Jonathan Ellis
              Reporter:
              Jonathan Ellis
              Reviewer:
              T Jake Luciani
            • Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development