Details

      Description

      When there are no other compactions to do, we trigger a single-sstable compaction if there is more than X% droppable tombstones in the sstable.

      In this ticket we should try to include overlapping sstables in those compactions to be able to actually drop the tombstones. Might only be doable with LCS (with STCS we would probably end up including all sstables)

      1. 7019-2-system.log
        562 kB
        Philip Thompson
      2. 7019-debug.log
        770 kB
        Philip Thompson
      3. cell.tar.gz
        1.63 MB
        Philip Thompson
      4. control.tar.gz
        2.51 MB
        Philip Thompson
      5. none.tar.gz
        2.46 MB
        Philip Thompson
      6. row.tar.gz
        2.25 MB
        Philip Thompson
      7. temp-plot.html
        1.17 MB
        Philip Thompson

        Issue Links

          Activity

          Hide
          jjordan Jeremiah Jordan added a comment -

          > once we detect that an sstable has more than x% (20%?) expired tombstones, we start one of these compactions, and include all overlapping sstables that contain older data.

          I like this (up to a max of course). We need a tombstone compaction that will pull in sstables that the tombstones mark deleted.

          Show
          jjordan Jeremiah Jordan added a comment - > once we detect that an sstable has more than x% (20%?) expired tombstones, we start one of these compactions, and include all overlapping sstables that contain older data. I like this (up to a max of course). We need a tombstone compaction that will pull in sstables that the tombstones mark deleted.
          Hide
          kohlisankalp sankalp kohli added a comment -

          I also like this idea. If you have IOPs to spare, why not compact across levels and get rid of extra data.
          I think we should call it "multilevel compaction". No of tombstones is one way to trigger it.

          Show
          kohlisankalp sankalp kohli added a comment - I also like this idea. If you have IOPs to spare, why not compact across levels and get rid of extra data. I think we should call it "multilevel compaction". No of tombstones is one way to trigger it.
          Hide
          aplotnik Alexey Plotnik added a comment -

          That's what we needed. We have LCS and a lot of SSTables. Compaction process always retain a lot of tombstones. We need something like super-compaction, or tombstone-compaction, or name it as you want. There is must be a procedure similar Cleanup, but for tombstones deletions. I like it.

          Show
          aplotnik Alexey Plotnik added a comment - That's what we needed. We have LCS and a lot of SSTables. Compaction process always retain a lot of tombstones. We need something like super-compaction, or tombstone-compaction, or name it as you want. There is must be a procedure similar Cleanup, but for tombstones deletions. I like it.
          Hide
          kohlisankalp sankalp kohli added a comment -

          Marcus Eriksson Thanks for picking this up . I think we can do other optimizations like putting all tombstones in the last level so that they can be dropped easily when they are past gc grace. Once we have repair aware gc grace, it will not be required.

          Show
          kohlisankalp sankalp kohli added a comment - Marcus Eriksson Thanks for picking this up . I think we can do other optimizations like putting all tombstones in the last level so that they can be dropped easily when they are past gc grace. Once we have repair aware gc grace, it will not be required.
          Hide
          krummas Marcus Eriksson added a comment -

          sankalp kohli ill post a proof of concept patch for option 1 in the description tomorrow, idea is to basically run a major compaction, but have the compaction strategy decide on an 'optimal' sstable distribution for the strategy instead of just creating a big one, for LCS it simply fills levels from level 1 and up. For STCS it will create sstables where one has 50%, one 25% of the data, etc until the sstables get too small.

          This is mostly for the "oh crap we have a ton of tombstones and need to get rid of them"-case, not for the day-to-day case, need to figure out something more for that (like your idea perhaps)

          Show
          krummas Marcus Eriksson added a comment - sankalp kohli ill post a proof of concept patch for option 1 in the description tomorrow, idea is to basically run a major compaction, but have the compaction strategy decide on an 'optimal' sstable distribution for the strategy instead of just creating a big one, for LCS it simply fills levels from level 1 and up. For STCS it will create sstables where one has 50%, one 25% of the data, etc until the sstables get too small. This is mostly for the "oh crap we have a ton of tombstones and need to get rid of them"-case, not for the day-to-day case, need to figure out something more for that (like your idea perhaps)
          Hide
          krummas Marcus Eriksson added a comment - - edited

          branch here: https://github.com/krummas/cassandra/commits/marcuse/7019-2

          triggered with nodetool compact -o <ks> <cf>

          It writes fully compacted partitions - each partition will only be in one single sstable - my first idea was to put the cells back in the corresponding files where they were found (minus tombstones), but it felt wrong to not actually write the compacted partition out when we have it.

          LCS:

          • creates an 'optimal' leveling - it takes all existing files, compacts them, and starts filling each level from L0 up
            • note that (if we have token range 0 -> 1000) L1 will get tokens 0 -> 10, L2 11 -> 100 and L3 101 -> 1000. Not thought much about if this is good/bad for future compactions.

          STCS:

          • calculates an 'optimal' distribution of sstables, currently it makes them 50%, 25%, 12.5% ... of total data size until the smallest sstable would be sub 50MB, then puts all the rest in the last sstable. If anyone has a more optimal sstable distribution, please let me know
            • the sstables will be non-overlapping, it starts writing the biggest sstable first and continues with the rest once 50% is in that
          Show
          krummas Marcus Eriksson added a comment - - edited branch here: https://github.com/krummas/cassandra/commits/marcuse/7019-2 triggered with nodetool compact -o <ks> <cf> It writes fully compacted partitions - each partition will only be in one single sstable - my first idea was to put the cells back in the corresponding files where they were found (minus tombstones), but it felt wrong to not actually write the compacted partition out when we have it. LCS: creates an 'optimal' leveling - it takes all existing files, compacts them, and starts filling each level from L0 up note that (if we have token range 0 -> 1000) L1 will get tokens 0 -> 10, L2 11 -> 100 and L3 101 -> 1000. Not thought much about if this is good/bad for future compactions. STCS: calculates an 'optimal' distribution of sstables, currently it makes them 50%, 25%, 12.5% ... of total data size until the smallest sstable would be sub 50MB, then puts all the rest in the last sstable. If anyone has a more optimal sstable distribution, please let me know the sstables will be non-overlapping, it starts writing the biggest sstable first and continues with the rest once 50% is in that
          Hide
          jjordan Jeremiah Jordan added a comment -

          Since this is going in 3.0, maybe we should make this the "default" nodetool compact. I don't know of any case where the STCS "put everything in one file" is really what people want. And for LCS all we used to do is run the compaction task like normal. If we still want a way to "kick" compaction for LCS, we could add a new "nodetool checkcompaction" command or something that just schedules the compaction manager to run (and does that for STCS and LCS). Doing that is useful when someone changes compaction settings and there are not currently writes happening to the system, so making it an explicit command sounds right to me.

          Show
          jjordan Jeremiah Jordan added a comment - Since this is going in 3.0, maybe we should make this the "default" nodetool compact. I don't know of any case where the STCS "put everything in one file" is really what people want. And for LCS all we used to do is run the compaction task like normal. If we still want a way to "kick" compaction for LCS, we could add a new "nodetool checkcompaction" command or something that just schedules the compaction manager to run (and does that for STCS and LCS). Doing that is useful when someone changes compaction settings and there are not currently writes happening to the system, so making it an explicit command sounds right to me.
          Hide
          carlyeks Carl Yeksigian added a comment -

          For LCS, we might be artifically penalizing early tokens. What if we started at the highest level which we are currently storing data in instead of at L1? It will be a good proxy for the size of the data that we are currently storing, and it will avoid unnecessarily recompacting data because we placed it in such a low level.

          I'm +1 to Jeremiah Jordan's proposal to change the default to this; I'd rather just add an option to compact to start minor compactions instead of adding a new command.

          Show
          carlyeks Carl Yeksigian added a comment - For LCS, we might be artifically penalizing early tokens. What if we started at the highest level which we are currently storing data in instead of at L1? It will be a good proxy for the size of the data that we are currently storing, and it will avoid unnecessarily recompacting data because we placed it in such a low level. I'm +1 to Jeremiah Jordan 's proposal to change the default to this; I'd rather just add an option to compact to start minor compactions instead of adding a new command.
          Hide
          kohlisankalp sankalp kohli added a comment -

          Carl Yeksigian Can you explain your idea about putting stables. If the application is using upto say L4, we should fill L4 then L3 and so on?

          Show
          kohlisankalp sankalp kohli added a comment - Carl Yeksigian Can you explain your idea about putting stables. If the application is using upto say L4, we should fill L4 then L3 and so on?
          Hide
          carlyeks Carl Yeksigian added a comment -

          I was thinking L4, then L5 (as in this patch, currently). Ideally, we would pick the level where all of the sstables would fit, but we don't know how many sstables will end up being produced by the compaction in the end, so this seems like a compromise. This would be similar to the thinking in CASSANDRA-6323.

          Show
          carlyeks Carl Yeksigian added a comment - I was thinking L4, then L5 (as in this patch, currently). Ideally, we would pick the level where all of the sstables would fit, but we don't know how many sstables will end up being produced by the compaction in the end, so this seems like a compromise. This would be similar to the thinking in CASSANDRA-6323 .
          Hide
          krummas Marcus Eriksson added a comment -

          The problem with starting in high levels is that it will take a long time before that data gets included in a (minor) compaction. This is basically a major compaction (like in current STCS)

          The option to not putting low tokens in lower levels is to write all levels at the same time and randomly distribute the tokens over the levels (and put 1% in L1, 10% in L2, 89% in L3), but i cant really see any difference compared to having the low tokens in one sstable, the number of overlapping tokens between a newly flushed file in L0 and L1 should be the same (if tokens are evenly distributed over the flushed sstable)

          Show
          krummas Marcus Eriksson added a comment - The problem with starting in high levels is that it will take a long time before that data gets included in a (minor) compaction. This is basically a major compaction (like in current STCS) The option to not putting low tokens in lower levels is to write all levels at the same time and randomly distribute the tokens over the levels (and put 1% in L1, 10% in L2, 89% in L3), but i cant really see any difference compared to having the low tokens in one sstable, the number of overlapping tokens between a newly flushed file in L0 and L1 should be the same (if tokens are evenly distributed over the flushed sstable)
          Hide
          carlyeks Carl Yeksigian added a comment -

          I have no problem with making it consistent but arbitrary which tokens go into L1/L2, just thought it would be better to put all of them in the same level since they'll move there eventually. I think you're right, though; they will end up not being included in minor compactions, so it would continually require a major tombstone compaction.

          Show
          carlyeks Carl Yeksigian added a comment - I have no problem with making it consistent but arbitrary which tokens go into L1/L2, just thought it would be better to put all of them in the same level since they'll move there eventually. I think you're right, though; they will end up not being included in minor compactions, so it would continually require a major tombstone compaction.
          Hide
          jbellis Jonathan Ellis added a comment - - edited

          The problem with starting in high levels is that it will take a long time before that data gets included in a (minor) compaction.

          But you still have that problem with the "start at L1 and fill every level up as you go" approach, just with 90% of your data instead of 100%.

          IMO the two options that make the most sense are:

          1. Just rewrite the existing tables minus tombstones without merging or changing levels, as originally proposed
          2. Write all the sstables out, then pick a level for them when complete such that all the sstables fit in the level (and they don't overlap with anything flushed + compacted by other threads in the meantime)
          Show
          jbellis Jonathan Ellis added a comment - - edited The problem with starting in high levels is that it will take a long time before that data gets included in a (minor) compaction. But you still have that problem with the "start at L1 and fill every level up as you go" approach, just with 90% of your data instead of 100%. IMO the two options that make the most sense are: Just rewrite the existing tables minus tombstones without merging or changing levels, as originally proposed Write all the sstables out, then pick a level for them when complete such that all the sstables fit in the level (and they don't overlap with anything flushed + compacted by other threads in the meantime)
          Hide
          krummas Marcus Eriksson added a comment -

          Just rewrite the existing tables minus tombstones without merging or changing levels, as originally proposed

          yeah this is probably the way to go, just hurts to have the compacted partition in-hand and throw it away

          Show
          krummas Marcus Eriksson added a comment - Just rewrite the existing tables minus tombstones without merging or changing levels, as originally proposed yeah this is probably the way to go, just hurts to have the compacted partition in-hand and throw it away
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          I dislike option 1, for personal reasons. Was really hoping to utilize major tombstone compaction for CASSANDRA-7975. Without that, I don't see a direct solution for LCS counter tables.

          Show
          iamaleksey Aleksey Yeschenko added a comment - I dislike option 1, for personal reasons. Was really hoping to utilize major tombstone compaction for CASSANDRA-7975 . Without that, I don't see a direct solution for LCS counter tables.
          Hide
          krummas Marcus Eriksson added a comment -

          Aleksey Yeschenko we could get rid of the shards in the Reducer, the same way we would get rid of tombstones (ie, merge all counter shards and put them in a single sstable)

          though, would perhaps be nice to keep this as a major compaction for LCS in addition to option 1

          Show
          krummas Marcus Eriksson added a comment - Aleksey Yeschenko we could get rid of the shards in the Reducer, the same way we would get rid of tombstones (ie, merge all counter shards and put them in a single sstable) though, would perhaps be nice to keep this as a major compaction for LCS in addition to option 1
          Hide
          jjordan Jeremiah Jordan added a comment -

          I think we should definitely keep the "major compact LCS" option. And change STCS major compaction to this new way. If we also want to have the "delete/tombstone only" compaction that just rewrites existing files without expired tombstones and deleted data (kind of like what half the people out there think cleanup currently does), that works too.

          Though I think that mode is a little harder to implement if we want it to remove tombstones and dead data, as you need to rewrite multiple files at once for that, and without doing that you aren't going to be able to remove all the old tombstones. Just the ones we were being too conservative on throwing out because we don't do a full read, but just a bloom filter check, to see if the tombstone still shadows existing data..

          Show
          jjordan Jeremiah Jordan added a comment - I think we should definitely keep the "major compact LCS" option. And change STCS major compaction to this new way. If we also want to have the "delete/tombstone only" compaction that just rewrites existing files without expired tombstones and deleted data (kind of like what half the people out there think cleanup currently does), that works too. Though I think that mode is a little harder to implement if we want it to remove tombstones and dead data, as you need to rewrite multiple files at once for that, and without doing that you aren't going to be able to remove all the old tombstones. Just the ones we were being too conservative on throwing out because we don't do a full read, but just a bloom filter check, to see if the tombstone still shadows existing data..
          Hide
          krummas Marcus Eriksson added a comment -

          WDYT Jonathan Ellis should I finish up the patch as a major compaction for LCS?

          Show
          krummas Marcus Eriksson added a comment - WDYT Jonathan Ellis should I finish up the patch as a major compaction for LCS?
          Hide
          jbellis Jonathan Ellis added a comment -

          That probably fits how compaction thinks of its job, better than trying to do it 1:1. +1 from me.

          Show
          jbellis Jonathan Ellis added a comment - That probably fits how compaction thinks of its job, better than trying to do it 1:1. +1 from me.
          Hide
          tjake T Jake Luciani added a comment -

          Just a note that this should also work on repaired sstables. As mentioned in CASSANDRA-7272 we repair the entire partition so we will end up with N copies of a partition in the repaired sstables.

          Show
          tjake T Jake Luciani added a comment - Just a note that this should also work on repaired sstables. As mentioned in CASSANDRA-7272 we repair the entire partition so we will end up with N copies of a partition in the repaired sstables.
          Hide
          krummas Marcus Eriksson added a comment -

          moving this patch to CASSANDRA-7272

          Show
          krummas Marcus Eriksson added a comment - moving this patch to CASSANDRA-7272
          Hide
          jbellis Jonathan Ellis added a comment -

          Did you mark 7272 as duplicate by mistake instead of this one?

          Show
          jbellis Jonathan Ellis added a comment - Did you mark 7272 as duplicate by mistake instead of this one?
          Hide
          krummas Marcus Eriksson added a comment - - edited

          Updated titles and reopened 7272 - this ticket is about improving the single-sstable tombstone compactions while 7272 is adding major compaction to LCS

          Show
          krummas Marcus Eriksson added a comment - - edited Updated titles and reopened 7272 - this ticket is about improving the single-sstable tombstone compactions while 7272 is adding major compaction to LCS
          Hide
          kohlisankalp sankalp kohli added a comment -

          "7019 is adding major compaction to LCS"
          You mean 7272

          Show
          kohlisankalp sankalp kohli added a comment - "7019 is adding major compaction to LCS" You mean 7272
          Hide
          krummas Marcus Eriksson added a comment -

          haha, thanks! updated

          Show
          krummas Marcus Eriksson added a comment - haha, thanks! updated
          Hide
          jbellis Jonathan Ellis added a comment -

          Assigning to Branimir. Marcus, can you summarize what the scope is and where to start?

          Show
          jbellis Jonathan Ellis added a comment - Assigning to Branimir. Marcus, can you summarize what the scope is and where to start?
          Hide
          krummas Marcus Eriksson added a comment -

          What we want is to be able to drop more tombstones by doing a specific tombstone removal compaction.

          To be able to drop as many tombstones as possible, we want to include as many overlapping sstables as we can in this compaction.

          Currently we do this with a single sstable - we find one single sstable, estimate how many droppable tombstones we have and if more than X% (20 iirc) of all keys in the sstables are droppable tombstones, we trigger a single sstable compaction including that. This is often quite ineffective as the tombstones can cover data in other sstables.

          Start by reading up on SizeTieredCompactionStrategy#worthDroppingTombstones()

          So, we need to

          1. Find a good candidate sstable
          2. Include all sstables that overlap that sstable and contain older data (a tombstone can only cover older data in other sstables)
          3. Start a compaction
          4. Figure out a good way to write out the data to disk (for STCS for example, all sstables might overlap eachother, which would cause a major compaction, for LCS we need to distribute the result in the leveled hierarchy somehow). This is the trickiest part of the ticket. One way I've though about is to track which sstable the data is coming from and map each input sstable to an output sstable, and write all non-tombstone data to those. The result would be the same number of input sstables, minus tombstones (and any covered data)
          Show
          krummas Marcus Eriksson added a comment - What we want is to be able to drop more tombstones by doing a specific tombstone removal compaction. To be able to drop as many tombstones as possible, we want to include as many overlapping sstables as we can in this compaction. Currently we do this with a single sstable - we find one single sstable, estimate how many droppable tombstones we have and if more than X% (20 iirc) of all keys in the sstables are droppable tombstones, we trigger a single sstable compaction including that. This is often quite ineffective as the tombstones can cover data in other sstables. Start by reading up on SizeTieredCompactionStrategy#worthDroppingTombstones() So, we need to Find a good candidate sstable Include all sstables that overlap that sstable and contain older data (a tombstone can only cover older data in other sstables) Start a compaction Figure out a good way to write out the data to disk (for STCS for example, all sstables might overlap eachother, which would cause a major compaction, for LCS we need to distribute the result in the leveled hierarchy somehow). This is the trickiest part of the ticket. One way I've though about is to track which sstable the data is coming from and map each input sstable to an output sstable, and write all non-tombstone data to those. The result would be the same number of input sstables, minus tombstones (and any covered data)
          Hide
          Bj0rn Björn Hegerfors added a comment - - edited

          I posted a related ticked some time ago, CASSANDRA-8359. In particular, the side note at the end is essentially this ticket exactly, for DTCS. A solution to this ticket may or may not solve the main issue in that ticket, but that's a matter for that ticket.

          Since DTCS SSTables are (supposed to be) separated into time windows, we have the concept of an oldest SSTable in a way that we don't with STCS. To me it seems pretty clear that a multi-SSTable tombstone compaction on n SSTables should always target the n oldest ones. The oldest one alone is practically guaranteed to overlap with any other SSTable, in terms of tokens. So picking the right SSTables for multi-tombstone compaction should be as easy as sorting by age (min timestamp), taking the oldest one, and include the newer ones in succession, checking at which point the tombstone ratio is the highest. Or something close to that, anyway. Then we might as well write them back as a single SSTable, I don't see why not.

          EDIT: moved the following to CASSANDRA-7272, where it belongs.

          As for the STCS case, I don't understand why major compaction for STCS isn't already optimal. I do see why one might want to compact some but not all SSTables in a multi-tombstone compaction (though DTCS should be a better fit for anyone wanting this). But if every single SSTable is being rewritten to disk, why not write them into one file? As far as I understand, the ultimate goal of STCS is to be one SSTable. STCS only gets there, the natural way, once in a blue moon. But that's the most optimal state that it can be in. Am I wrong?

          The only explanation I can see for splitting the result of compacting all SSTables into fragments, is if those fragments are:
          1. Partitioned smartly. For example into separate token ranges (à la LCS), timestamp ranges (à la DTCS) or clustering column ranges (which would be interesting). Or a combination of these.
          2. The structure upheld by the resulting fragments is not subsequently demolished by the running compaction strategy going on with its usual business.

          Show
          Bj0rn Björn Hegerfors added a comment - - edited I posted a related ticked some time ago, CASSANDRA-8359 . In particular, the side note at the end is essentially this ticket exactly, for DTCS. A solution to this ticket may or may not solve the main issue in that ticket, but that's a matter for that ticket. Since DTCS SSTables are (supposed to be) separated into time windows, we have the concept of an oldest SSTable in a way that we don't with STCS. To me it seems pretty clear that a multi-SSTable tombstone compaction on n SSTables should always target the n oldest ones. The oldest one alone is practically guaranteed to overlap with any other SSTable, in terms of tokens. So picking the right SSTables for multi-tombstone compaction should be as easy as sorting by age (min timestamp), taking the oldest one, and include the newer ones in succession, checking at which point the tombstone ratio is the highest. Or something close to that, anyway. Then we might as well write them back as a single SSTable, I don't see why not. EDIT: moved the following to CASSANDRA-7272 , where it belongs. As for the STCS case, I don't understand why major compaction for STCS isn't already optimal. I do see why one might want to compact some but not all SSTables in a multi-tombstone compaction (though DTCS should be a better fit for anyone wanting this). But if every single SSTable is being rewritten to disk, why not write them into one file? As far as I understand, the ultimate goal of STCS is to be one SSTable. STCS only gets there, the natural way, once in a blue moon. But that's the most optimal state that it can be in. Am I wrong? The only explanation I can see for splitting the result of compacting all SSTables into fragments, is if those fragments are: 1. Partitioned smartly. For example into separate token ranges (à la LCS), timestamp ranges (à la DTCS) or clustering column ranges (which would be interesting). Or a combination of these. 2. The structure upheld by the resulting fragments is not subsequently demolished by the running compaction strategy going on with its usual business.
          Hide
          blambov Branimir Lambov added a comment -

          My understanding of the problem is that since we only remove deleted content when we compact together the data and its tombstone, there is not much chance tombstones or data can actually be removed without a global (major) compaction. This is far from ideal for levelled, and also a problem for any other compaction method due to the size of the operation.

          To enable smaller-sized transformations that do delete data (which would later enable the removal of tombstones as well), I made a work-in-progress patch of an option to use all overlapping sstables as sources of tombstones when doing compaction. These tombstones are used only to filter out deleted content, and will not appear in the resulting tables. The process will slow the compactions to which it is applied down as it does have to read from a lot of sstables, but hopefully the slowdown will be acceptable with the 8099 improvements.

          The patch is uploaded with that option turned permanently on (for testing). I am not yet sure how best to use it, but I can see several possibilities:

          • as an option that can be generally turned on (via -D and yaml option, best via JMX as well) for a node and applies to all compactions. This will trade compaction performance for improved dataset size and read performance and may not be suitable for all scenarios.
          • as an option that is turned on by strategy, e.g. specifically for the levelled compaction strategy.
          • as a flag that can be turned on when applying specific compactions (e.g. cleanup).
          • as a new compaction type that is applied to one sstable at a time to remove deleted content from that table.
          • as above, but triggered automatically when a tombstone compaction identifies tables with deleted content.

          I personally think the first option with a default value of "on" for the levelled strategy makes most sense, but I want to hear other opinions.

          Marcus Eriksson, could you take a look and tell me what you think of the approach and code, and which of the above makes most sense to you?

          Show
          blambov Branimir Lambov added a comment - My understanding of the problem is that since we only remove deleted content when we compact together the data and its tombstone, there is not much chance tombstones or data can actually be removed without a global (major) compaction. This is far from ideal for levelled, and also a problem for any other compaction method due to the size of the operation. To enable smaller-sized transformations that do delete data (which would later enable the removal of tombstones as well), I made a work-in-progress patch of an option to use all overlapping sstables as sources of tombstones when doing compaction. These tombstones are used only to filter out deleted content, and will not appear in the resulting tables. The process will slow the compactions to which it is applied down as it does have to read from a lot of sstables, but hopefully the slowdown will be acceptable with the 8099 improvements. The patch is uploaded with that option turned permanently on (for testing). I am not yet sure how best to use it, but I can see several possibilities: as an option that can be generally turned on (via -D and yaml option, best via JMX as well) for a node and applies to all compactions. This will trade compaction performance for improved dataset size and read performance and may not be suitable for all scenarios. as an option that is turned on by strategy, e.g. specifically for the levelled compaction strategy. as a flag that can be turned on when applying specific compactions (e.g. cleanup). as a new compaction type that is applied to one sstable at a time to remove deleted content from that table. as above, but triggered automatically when a tombstone compaction identifies tables with deleted content. I personally think the first option with a default value of "on" for the levelled strategy makes most sense, but I want to hear other opinions. Marcus Eriksson , could you take a look and tell me what you think of the approach and code, and which of the above makes most sense to you?
          Hide
          krummas Marcus Eriksson added a comment -

          Perhaps we could apply it to the higher (L3+?) levels in leveled compaction? Or perhaps the highest and highest - 1 levels. I see two reasons for this;

          • we would reduce the number of sstables we have to read when doing these compactions - since most sstables in L0 overlap all other sstables we would read everything on disk if we did this for all levels. An sstable in L3 will only overlap 1 (or 2) sstables in each of L2 and L1.
          • data in the higher levels should be older, meaning we most often can drop more covered data/tombstones when doing the high level compactions

          Question: (I only quickly skimmed the code) do we drop overwritten data as well?

          This approach will not be 'perfect' like the 'compact all sstables 1:1'-approach outlined above, but I like it - it is quite simple to reason about and should give good results (if the performance does not turn out to be horrible).

          Show
          krummas Marcus Eriksson added a comment - Perhaps we could apply it to the higher (L3+?) levels in leveled compaction? Or perhaps the highest and highest - 1 levels. I see two reasons for this; we would reduce the number of sstables we have to read when doing these compactions - since most sstables in L0 overlap all other sstables we would read everything on disk if we did this for all levels. An sstable in L3 will only overlap 1 (or 2) sstables in each of L2 and L1. data in the higher levels should be older, meaning we most often can drop more covered data/tombstones when doing the high level compactions Question: (I only quickly skimmed the code) do we drop overwritten data as well? This approach will not be 'perfect' like the 'compact all sstables 1:1'-approach outlined above, but I like it - it is quite simple to reason about and should give good results (if the performance does not turn out to be horrible).
          Hide
          blambov Branimir Lambov added a comment -

          Perhaps we could apply it to the higher (L3+?) levels in leveled compaction?

          This makes sense. Better still, we can probably exclude older tables altogether (as they can't have newer tombstones) which should avoid bringing higher-level tables on a lower-level compaction.

          do we drop overwritten data as well?

          No. I was hoping the table iterators could deserialize just a row header, which after reading the code more closely doesn't appear to be the case. I'll do some performance evaluations next and experiment with different variations (e.g. slicing to reach the row within partition instead of iterating as well as looking within the row).

          Show
          blambov Branimir Lambov added a comment - Perhaps we could apply it to the higher (L3+?) levels in leveled compaction? This makes sense. Better still, we can probably exclude older tables altogether (as they can't have newer tombstones) which should avoid bringing higher-level tables on a lower-level compaction. do we drop overwritten data as well? No. I was hoping the table iterators could deserialize just a row header, which after reading the code more closely doesn't appear to be the case. I'll do some performance evaluations next and experiment with different variations (e.g. slicing to reach the row within partition instead of iterating as well as looking within the row).
          Hide
          blambov Branimir Lambov added a comment -

          Updated the patch, it is now ready for review.

          Adds a compaction option called provide_overlapping_tombstones which enables the use of tombstones from overlapping sstables in all compactions. Adds tests and a benchmark of compaction performance with and without the flag. This version will not use sstables whose max timestamp mean they cannot include relevant tombstones (which solves the higher-levels in leveled compaction problem).

          Performance appears to be very acceptable:

          {"class":"org.apache.cassandra.db.compaction.LeveledCompactionStrategy"}
          Copy compactions completed in 3.302s
          Operations completed in 288.778s, out of which 37.088 for ongoing background compactions
          At start:            7 tables    455928060 bytes       432373 rows       165634 deleted rows        65872 tombstone markers
          At end:              7 tables    455932906 bytes       432373 rows       165634 deleted rows        65872 tombstone markers
          
          {"class":"org.apache.cassandra.db.compaction.LeveledCompactionStrategy"}
          GC compactions completed in 3.106s
          Operations completed in 314.313s, out of which 38.003 for ongoing GC compactions
          At start:            7 tables    456105855 bytes       432373 rows       165634 deleted rows        65872 tombstone markers
          At end:              7 tables    362819985 bytes       343661 rows       142714 deleted rows        63222 tombstone markers
          
          {"max_threshold":"32","min_threshold":"4","class":"org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy"}
          Copy compactions completed in 7.924s
          Operations completed in 481.763s, out of which 16.680 for ongoing background compactions
          At start:           11 tables   1387890116 bytes      1318812 rows       358724 deleted rows       190539 tombstone markers
          At end:             11 tables   1387887287 bytes      1318812 rows       358724 deleted rows       190539 tombstone markers
          
          {"max_threshold":"32","min_threshold":"4","class":"org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy"}
          GC compactions completed in 5.551s
          Operations completed in 416.796s, out of which 19.210 for ongoing GC background compactions
          At start:           11 tables   1176568240 bytes      1117556 rows       323352 deleted rows       178182 tombstone markers
          At end:             11 tables    362865780 bytes       343728 rows       142714 deleted rows        88709 tombstone markers
          

          Benchmark adds and deletes data in a table, does compaction at predefined times, and finally does a one-by-one compaction on all sstables to remove deleted data. Turning the flag on does not have a visible effect on the performance of the ongoing compactions, and even the GC at the end is not much slower than just copying.

          Because it removes deleted data, it can cause compaction to run faster with the flag enabled in workloads that include deletions.

          My plan is create several sub-tickets next:

          • To further improve performance by reshuffling BigTableScanner code to implement tombstone iteration using a shared skipping mechanism with slicing.
          • To add a GC compaction operation as done by the benchmark to improve the state of existing tables.
          Show
          blambov Branimir Lambov added a comment - Updated the patch , it is now ready for review. Adds a compaction option called provide_overlapping_tombstones which enables the use of tombstones from overlapping sstables in all compactions. Adds tests and a benchmark of compaction performance with and without the flag. This version will not use sstables whose max timestamp mean they cannot include relevant tombstones (which solves the higher-levels in leveled compaction problem). Performance appears to be very acceptable: { "class" : "org.apache.cassandra.db.compaction.LeveledCompactionStrategy" } Copy compactions completed in 3.302s Operations completed in 288.778s, out of which 37.088 for ongoing background compactions At start: 7 tables 455928060 bytes 432373 rows 165634 deleted rows 65872 tombstone markers At end: 7 tables 455932906 bytes 432373 rows 165634 deleted rows 65872 tombstone markers { "class" : "org.apache.cassandra.db.compaction.LeveledCompactionStrategy" } GC compactions completed in 3.106s Operations completed in 314.313s, out of which 38.003 for ongoing GC compactions At start: 7 tables 456105855 bytes 432373 rows 165634 deleted rows 65872 tombstone markers At end: 7 tables 362819985 bytes 343661 rows 142714 deleted rows 63222 tombstone markers { "max_threshold" : "32" , "min_threshold" : "4" , "class" : "org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy" } Copy compactions completed in 7.924s Operations completed in 481.763s, out of which 16.680 for ongoing background compactions At start: 11 tables 1387890116 bytes 1318812 rows 358724 deleted rows 190539 tombstone markers At end: 11 tables 1387887287 bytes 1318812 rows 358724 deleted rows 190539 tombstone markers { "max_threshold" : "32" , "min_threshold" : "4" , "class" : "org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy" } GC compactions completed in 5.551s Operations completed in 416.796s, out of which 19.210 for ongoing GC background compactions At start: 11 tables 1176568240 bytes 1117556 rows 323352 deleted rows 178182 tombstone markers At end: 11 tables 362865780 bytes 343728 rows 142714 deleted rows 88709 tombstone markers Benchmark adds and deletes data in a table, does compaction at predefined times, and finally does a one-by-one compaction on all sstables to remove deleted data. Turning the flag on does not have a visible effect on the performance of the ongoing compactions, and even the GC at the end is not much slower than just copying. Because it removes deleted data, it can cause compaction to run faster with the flag enabled in workloads that include deletions. My plan is create several sub-tickets next: To further improve performance by reshuffling BigTableScanner code to implement tombstone iteration using a shared skipping mechanism with slicing. To add a GC compaction operation as done by the benchmark to improve the state of existing tables.
          Hide
          krummas Marcus Eriksson added a comment -

          Ok, a few comments (pushed here)

          • tombstone source iterator is not closed
          • we can never bring in data from the tombstone sources into the new compacted sstable (it brings in the partition level deletion) - this could mess up the timestamps for DTCS for example
          • if we are running with "only_purge_repaired_tombstones" and are compacting unrepaired data we can't drop any tombstones

          thoughts/questions;

          • did you compare performance when dropping overwritten data?
          • we should probably run this for a while on a large data set (unless you did that already?)
          • cell level tombstones are not handled, is that for performance reasons? (do we avoid deserializing the row?)
          • should we always enable this for the single-sstable tombstone compactions?
          Show
          krummas Marcus Eriksson added a comment - Ok, a few comments (pushed here ) tombstone source iterator is not closed we can never bring in data from the tombstone sources into the new compacted sstable (it brings in the partition level deletion) - this could mess up the timestamps for DTCS for example if we are running with "only_purge_repaired_tombstones" and are compacting unrepaired data we can't drop any tombstones thoughts/questions; did you compare performance when dropping overwritten data? we should probably run this for a while on a large data set (unless you did that already?) cell level tombstones are not handled, is that for performance reasons? (do we avoid deserializing the row?) should we always enable this for the single-sstable tombstone compactions ?
          Hide
          blambov Branimir Lambov added a comment -

          Thanks for the updates.

          I haven't looked at dropping overwritten data, or cell level tombstones yet – the idea still is to avoid deserializing (though we currently don't). I intend to look at that after I have a non-deserializing option to compare with.

          should we always enable this for the single-sstable tombstone compactions?

          It shouldn't hurt. I'll make the change.

          we should probably run this for a while on a large data set

          Any ideas how to try that? Stress does not appear to support deletions. I'll try adding that now.

          Show
          blambov Branimir Lambov added a comment - Thanks for the updates. I haven't looked at dropping overwritten data, or cell level tombstones yet – the idea still is to avoid deserializing (though we currently don't). I intend to look at that after I have a non-deserializing option to compare with. should we always enable this for the single-sstable tombstone compactions? It shouldn't hurt. I'll make the change. we should probably run this for a while on a large data set Any ideas how to try that? Stress does not appear to support deletions. I'll try adding that now.
          Hide
          blambov Branimir Lambov added a comment -

          Uploaded a new version here:

          code utest dtest

          Changes:

          • The option is changed to an enum with three values: NONE, ROW and CELL, controlling what level of deletions and overwrites to examine.
          • ROW option is as in the previous version, but now uses an implementation of the simple sstable iterator that only decodes tombstones and row deletions, skipping over row content (for 3.0+ tables only). It also skips sstables that do not have tombstones.
          • CELL option also examines row content to find overwritten or deleted cells.
          • The partition level deletion is now handled properly – partially undoing your change – it is removed if superseded by the one from the tombstone source. The latter is also used to filter the partition content.

          Minor additional changes:

          • Data file references of the tombstone tables are now explicitly opened and closed, only once.
          • Fixes bug in hashCode calculation for BTreeRow, which was always producing a different value.
          • Fixes unnecessary sorting in finding table for tombstone compaction.
          • Adds more tests and fixes test failures.

          Performance run results:

          {"provide_overlapping_tombstones":"CELL","class":"org.apache.cassandra.db.compaction.LeveledCompactionStrategy"}
          CELL compactions completed in 6.364s
          Operations completed in 394.591s, out of which 52.562 for ongoing NONE background compactions
          At start:            9 tables    922541625 bytes       876088 rows       423530 deleted rows        42867 tombstone markers
          At end:              9 tables    853445991 bytes       810249 rows       407096 deleted rows        41779 tombstone markers
          
          {"provide_overlapping_tombstones":"ROW","class":"org.apache.cassandra.db.compaction.LeveledCompactionStrategy"}
          ROW compactions completed in 6.577s
          Operations completed in 408.181s, out of which 54.373 for ongoing NONE background compactions
          At start:            9 tables    922539568 bytes       876088 rows       423530 deleted rows        42867 tombstone markers
          At end:              9 tables    853446320 bytes       810249 rows       407096 deleted rows        41779 tombstone markers
          
          {"provide_overlapping_tombstones":"NONE","class":"org.apache.cassandra.db.compaction.LeveledCompactionStrategy"}
          NONE compactions completed in 6.415s
          Operations completed in 402.645s, out of which 53.084 for ongoing NONE background compactions
          At start:            9 tables    922534683 bytes       876088 rows       423530 deleted rows        42867 tombstone markers
          At end:              9 tables    922531607 bytes       876088 rows       423530 deleted rows        42867 tombstone markers
          
          {"max_threshold":"32","min_threshold":"4","provide_overlapping_tombstones":"CELL","class":"org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy"}
          CELL compactions completed in 10.119s
          Operations completed in 527.998s, out of which 18.164 for ongoing NONE background compactions
          At start:           12 tables   1627719240 bytes      1549035 rows       551694 deleted rows        68948 tombstone markers
          At end:             12 tables    853460582 bytes       835123 rows       407096 deleted rows        51964 tombstone markers
          
          {"max_threshold":"32","min_threshold":"4","provide_overlapping_tombstones":"ROW","class":"org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy"}
          ROW compactions completed in 8.299s
          Operations completed in 519.072s, out of which 18.572 for ongoing NONE background compactions
          At start:           12 tables   1627702075 bytes      1549035 rows       551694 deleted rows        68948 tombstone markers
          At end:             12 tables    879153760 bytes       835123 rows       407096 deleted rows        51964 tombstone markers
          
          {"max_threshold":"32","min_threshold":"4","provide_overlapping_tombstones":"NONE","class":"org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy"}
          NONE compactions completed in 9.465s
          Operations completed in 509.603s, out of which 18.052 for ongoing NONE background compactions
          At start:           12 tables   1627710033 bytes      1549035 rows       551694 deleted rows        68948 tombstone markers
          At end:             12 tables   1627706918 bytes      1549035 rows       551694 deleted rows        68948 tombstone markers
          

          For size tiered ROW does most of the work in much shorter time, but there are certain to be scenarios where CELL helps more. The run doesn't appear to be long enough to see the effects for leveled, I'll add validation and start a longer one this evening.

          Some of your points still remain:

          • I haven't been able to do a cstar_perf test yet. Working on it.
          • Single-table compactions still don't have this turned on by default – need to test and choose CELL/ROW, also figure out if scrub/upgrade/cleanup etc should be doing it.
          Show
          blambov Branimir Lambov added a comment - Uploaded a new version here: code utest dtest Changes: The option is changed to an enum with three values: NONE, ROW and CELL, controlling what level of deletions and overwrites to examine. ROW option is as in the previous version, but now uses an implementation of the simple sstable iterator that only decodes tombstones and row deletions, skipping over row content (for 3.0+ tables only). It also skips sstables that do not have tombstones. CELL option also examines row content to find overwritten or deleted cells. The partition level deletion is now handled properly – partially undoing your change – it is removed if superseded by the one from the tombstone source. The latter is also used to filter the partition content. Minor additional changes: Data file references of the tombstone tables are now explicitly opened and closed, only once. Fixes bug in hashCode calculation for BTreeRow , which was always producing a different value. Fixes unnecessary sorting in finding table for tombstone compaction. Adds more tests and fixes test failures. Performance run results: { "provide_overlapping_tombstones" : "CELL" , "class" : "org.apache.cassandra.db.compaction.LeveledCompactionStrategy" } CELL compactions completed in 6.364s Operations completed in 394.591s, out of which 52.562 for ongoing NONE background compactions At start: 9 tables 922541625 bytes 876088 rows 423530 deleted rows 42867 tombstone markers At end: 9 tables 853445991 bytes 810249 rows 407096 deleted rows 41779 tombstone markers { "provide_overlapping_tombstones" : "ROW" , "class" : "org.apache.cassandra.db.compaction.LeveledCompactionStrategy" } ROW compactions completed in 6.577s Operations completed in 408.181s, out of which 54.373 for ongoing NONE background compactions At start: 9 tables 922539568 bytes 876088 rows 423530 deleted rows 42867 tombstone markers At end: 9 tables 853446320 bytes 810249 rows 407096 deleted rows 41779 tombstone markers { "provide_overlapping_tombstones" : "NONE" , "class" : "org.apache.cassandra.db.compaction.LeveledCompactionStrategy" } NONE compactions completed in 6.415s Operations completed in 402.645s, out of which 53.084 for ongoing NONE background compactions At start: 9 tables 922534683 bytes 876088 rows 423530 deleted rows 42867 tombstone markers At end: 9 tables 922531607 bytes 876088 rows 423530 deleted rows 42867 tombstone markers { "max_threshold" : "32" , "min_threshold" : "4" , "provide_overlapping_tombstones" : "CELL" , "class" : "org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy" } CELL compactions completed in 10.119s Operations completed in 527.998s, out of which 18.164 for ongoing NONE background compactions At start: 12 tables 1627719240 bytes 1549035 rows 551694 deleted rows 68948 tombstone markers At end: 12 tables 853460582 bytes 835123 rows 407096 deleted rows 51964 tombstone markers { "max_threshold" : "32" , "min_threshold" : "4" , "provide_overlapping_tombstones" : "ROW" , "class" : "org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy" } ROW compactions completed in 8.299s Operations completed in 519.072s, out of which 18.572 for ongoing NONE background compactions At start: 12 tables 1627702075 bytes 1549035 rows 551694 deleted rows 68948 tombstone markers At end: 12 tables 879153760 bytes 835123 rows 407096 deleted rows 51964 tombstone markers { "max_threshold" : "32" , "min_threshold" : "4" , "provide_overlapping_tombstones" : "NONE" , "class" : "org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy" } NONE compactions completed in 9.465s Operations completed in 509.603s, out of which 18.052 for ongoing NONE background compactions At start: 12 tables 1627710033 bytes 1549035 rows 551694 deleted rows 68948 tombstone markers At end: 12 tables 1627706918 bytes 1549035 rows 551694 deleted rows 68948 tombstone markers For size tiered ROW does most of the work in much shorter time, but there are certain to be scenarios where CELL helps more. The run doesn't appear to be long enough to see the effects for leveled, I'll add validation and start a longer one this evening. Some of your points still remain: I haven't been able to do a cstar_perf test yet. Working on it. Single-table compactions still don't have this turned on by default – need to test and choose CELL/ROW, also figure out if scrub/upgrade/cleanup etc should be doing it.
          Hide
          blambov Branimir Lambov added a comment -

          Results of the longer run last evening:

          Hashes: [495354, 748895267, -2068352751, 1145481084, -1976889921, 9, 2539, -34], retrieved in 17.772s
          {"provide_overlapping_tombstones":"CELL","class":"org.apache.cassandra.db.compaction.LeveledCompactionStrategy","sstable_size_in_mb":"16"}
          CELL compactions completed in 13.662s
          Operations completed in 3147.977s, out of which 88.668 for ongoing CELL background compactions
          At start:           53 tables   1015895699 bytes       985074 rows       450179 deleted rows        49132 tombstone markers
          At end:             53 tables    522294953 bytes       526978 rows       274600 deleted rows        35211 tombstone markers
          Hashes: [495354, 748895267, -2068352751, 1145481084, -1976889921, 9, 2539, -34], retrieved in 16.924s
          
          Hashes: [495354, 748895267, -2068352751, 1145481084, -1976889921, 9, 2539, -34], retrieved in 18.580s
          {"provide_overlapping_tombstones":"ROW","class":"org.apache.cassandra.db.compaction.LeveledCompactionStrategy","sstable_size_in_mb":"16"}
          ROW compactions completed in 11.817s
          Operations completed in 3370.982s, out of which 87.959 for ongoing ROW background compactions
          At start:           52 tables   1034287909 bytes       983095 rows       444773 deleted rows        48188 tombstone markers
          At end:             52 tables    555226285 bytes       527258 rows       274600 deleted rows        35055 tombstone markers
          Hashes: [495354, 748895267, -2068352751, 1145481084, -1976889921, 9, 2539, -34], retrieved in 15.539s
          
          Hashes: [495354, 748895267, -2068352751, 1145481084, -1976889921, 9, 2539, -34], retrieved in 17.840s
          {"provide_overlapping_tombstones":"NONE","class":"org.apache.cassandra.db.compaction.LeveledCompactionStrategy","sstable_size_in_mb":"16"}
          NONE compactions completed in 17.979s
          Operations completed in 4319.732s, out of which 83.286 for ongoing NONE background compactions
          At start:           57 tables   1333377280 bytes      1267545 rows       556267 deleted rows        49755 tombstone markers
          At end:             57 tables   1333372458 bytes      1267545 rows       556267 deleted rows        49755 tombstone markers
          Hashes: [495354, 748895267, -2068352751, 1145481084, -1976889921, 9, 2539, -34], retrieved in 17.709s
          
          Hashes: [495354, 748895267, -2068352751, 1145481084, -1976889921, 9, 2539, -34], retrieved in 19.713s
          {"max_threshold":"32","min_threshold":"4","provide_overlapping_tombstones":"CELL","min_sstable_size":"0","class":"org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy"}
          CELL compactions completed in 24.677s
          Operations completed in 4421.714s, out of which 91.985 for ongoing CELL background compactions
          At start:           23 tables   1842748717 bytes      1780330 rows       465835 deleted rows        52040 tombstone markers
          At end:             23 tables    522233965 bytes       542078 rows       274600 deleted rows        34771 tombstone markers
          Hashes: [495354, 748895267, -2068352751, 1145481084, -1976889921, 9, 2539, -34], retrieved in 17.628s
          
          Hashes: [495354, 748895267, -2068352751, 1145481084, -1976889921, 9, 2539, -34], retrieved in 19.095s
          {"max_threshold":"32","min_threshold":"4","provide_overlapping_tombstones":"ROW","min_sstable_size":"0","class":"org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy"}
          ROW compactions completed in 12.763s
          Operations completed in 4223.531s, out of which 59.436 for ongoing ROW background compactions
          At start:           25 tables   1744662122 bytes      1662199 rows       526525 deleted rows        63241 tombstone markers
          At end:             25 tables    566431602 bytes       538314 rows       274600 deleted rows        37604 tombstone markers
          Hashes: [495354, 748895267, -2068352751, 1145481084, -1976889921, 9, 2539, -34], retrieved in 15.674s
          
          Hashes: [495354, 748895267, -2068352751, 1145481084, -1976889921, 9, 2539, -34], retrieved in 20.745s
          {"max_threshold":"32","min_threshold":"4","provide_overlapping_tombstones":"NONE","min_sstable_size":"0","class":"org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy"}
          NONE compactions completed in 18.571s
          Operations completed in 8617.601s, out of which 79.785 for ongoing NONE background compactions
          At start:           20 tables   2939870504 bytes      2800985 rows       756021 deleted rows        69308 tombstone markers
          At end:             20 tables   2939870070 bytes      2800985 rows       756021 deleted rows        69308 tombstone markers
          Hashes: [495354, 748895267, -2068352751, 1145481084, -1976889921, 9, 2539, -34], retrieved in 21.108s
          

          There are two differences here:

          • The test now runs hash UDF/UDAs before and after the GC compaction at the end and compares the results to verify no data has changed.
          • The compactions applied during the construction of the test data ("background compactions" above) are of the tested type.

          The branch is updated with the new version of the test.

          Show
          blambov Branimir Lambov added a comment - Results of the longer run last evening: Hashes: [495354, 748895267, -2068352751, 1145481084, -1976889921, 9, 2539, -34], retrieved in 17.772s { "provide_overlapping_tombstones" : "CELL" , "class" : "org.apache.cassandra.db.compaction.LeveledCompactionStrategy" , "sstable_size_in_mb" : "16" } CELL compactions completed in 13.662s Operations completed in 3147.977s, out of which 88.668 for ongoing CELL background compactions At start: 53 tables 1015895699 bytes 985074 rows 450179 deleted rows 49132 tombstone markers At end: 53 tables 522294953 bytes 526978 rows 274600 deleted rows 35211 tombstone markers Hashes: [495354, 748895267, -2068352751, 1145481084, -1976889921, 9, 2539, -34], retrieved in 16.924s Hashes: [495354, 748895267, -2068352751, 1145481084, -1976889921, 9, 2539, -34], retrieved in 18.580s { "provide_overlapping_tombstones" : "ROW" , "class" : "org.apache.cassandra.db.compaction.LeveledCompactionStrategy" , "sstable_size_in_mb" : "16" } ROW compactions completed in 11.817s Operations completed in 3370.982s, out of which 87.959 for ongoing ROW background compactions At start: 52 tables 1034287909 bytes 983095 rows 444773 deleted rows 48188 tombstone markers At end: 52 tables 555226285 bytes 527258 rows 274600 deleted rows 35055 tombstone markers Hashes: [495354, 748895267, -2068352751, 1145481084, -1976889921, 9, 2539, -34], retrieved in 15.539s Hashes: [495354, 748895267, -2068352751, 1145481084, -1976889921, 9, 2539, -34], retrieved in 17.840s { "provide_overlapping_tombstones" : "NONE" , "class" : "org.apache.cassandra.db.compaction.LeveledCompactionStrategy" , "sstable_size_in_mb" : "16" } NONE compactions completed in 17.979s Operations completed in 4319.732s, out of which 83.286 for ongoing NONE background compactions At start: 57 tables 1333377280 bytes 1267545 rows 556267 deleted rows 49755 tombstone markers At end: 57 tables 1333372458 bytes 1267545 rows 556267 deleted rows 49755 tombstone markers Hashes: [495354, 748895267, -2068352751, 1145481084, -1976889921, 9, 2539, -34], retrieved in 17.709s Hashes: [495354, 748895267, -2068352751, 1145481084, -1976889921, 9, 2539, -34], retrieved in 19.713s { "max_threshold" : "32" , "min_threshold" : "4" , "provide_overlapping_tombstones" : "CELL" , "min_sstable_size" : "0" , "class" : "org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy" } CELL compactions completed in 24.677s Operations completed in 4421.714s, out of which 91.985 for ongoing CELL background compactions At start: 23 tables 1842748717 bytes 1780330 rows 465835 deleted rows 52040 tombstone markers At end: 23 tables 522233965 bytes 542078 rows 274600 deleted rows 34771 tombstone markers Hashes: [495354, 748895267, -2068352751, 1145481084, -1976889921, 9, 2539, -34], retrieved in 17.628s Hashes: [495354, 748895267, -2068352751, 1145481084, -1976889921, 9, 2539, -34], retrieved in 19.095s { "max_threshold" : "32" , "min_threshold" : "4" , "provide_overlapping_tombstones" : "ROW" , "min_sstable_size" : "0" , "class" : "org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy" } ROW compactions completed in 12.763s Operations completed in 4223.531s, out of which 59.436 for ongoing ROW background compactions At start: 25 tables 1744662122 bytes 1662199 rows 526525 deleted rows 63241 tombstone markers At end: 25 tables 566431602 bytes 538314 rows 274600 deleted rows 37604 tombstone markers Hashes: [495354, 748895267, -2068352751, 1145481084, -1976889921, 9, 2539, -34], retrieved in 15.674s Hashes: [495354, 748895267, -2068352751, 1145481084, -1976889921, 9, 2539, -34], retrieved in 20.745s { "max_threshold" : "32" , "min_threshold" : "4" , "provide_overlapping_tombstones" : "NONE" , "min_sstable_size" : "0" , "class" : "org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy" } NONE compactions completed in 18.571s Operations completed in 8617.601s, out of which 79.785 for ongoing NONE background compactions At start: 20 tables 2939870504 bytes 2800985 rows 756021 deleted rows 69308 tombstone markers At end: 20 tables 2939870070 bytes 2800985 rows 756021 deleted rows 69308 tombstone markers Hashes: [495354, 748895267, -2068352751, 1145481084, -1976889921, 9, 2539, -34], retrieved in 21.108s There are two differences here: The test now runs hash UDF/UDAs before and after the GC compaction at the end and compares the results to verify no data has changed. The compactions applied during the construction of the test data ("background compactions" above) are of the tested type. The branch is updated with the new version of the test.
          Hide
          blambov Branimir Lambov added a comment -

          Implemented a "GarbageCollect" compaction operation that does compaction to each table separately and takes a TombstoneOption argument to specify the granularity of the collection. Exposed as JMX method and nodetool command. The code is uploaded here:

          branch diff to prev version utest dtest cstar_perf

          The perf run is trying to see a difference between using the three provide_overlapping_tombstones options, as well as the effect of running nodetool garbagecollect on later reads. It does appear to show slightly faster reads during the second stage for the GC-enabled, as well as significant improvement for reads after both garbage collection passes.

          Note: Please disregard delete.yaml, logger.info in CompactionController and the "Garbage collecting..." printout in GarbageCollect. They are temporary, to make sure cstar_perf is doing the right thing.

          Show
          blambov Branimir Lambov added a comment - Implemented a "GarbageCollect" compaction operation that does compaction to each table separately and takes a TombstoneOption argument to specify the granularity of the collection. Exposed as JMX method and nodetool command. The code is uploaded here: branch diff to prev version utest dtest cstar_perf The perf run is trying to see a difference between using the three provide_overlapping_tombstones options, as well as the effect of running nodetool garbagecollect on later reads. It does appear to show slightly faster reads during the second stage for the GC-enabled, as well as significant improvement for reads after both garbage collection passes. Note: Please disregard delete.yaml , logger.info in CompactionController and the "Garbage collecting..." printout in GarbageCollect . They are temporary, to make sure cstar_perf is doing the right thing.
          Hide
          krummas Marcus Eriksson added a comment - - edited
          • comment on Rows#removeShadowedCells needs updating
          • in bde4c02cc858a900d43028b9a930e805ab232c27 there seems to be a few unrelated fixes (like the AbstractRow hashCode fix for example), should we break them out in a separate ticket? (so that we get them in 3.0 as well)
          • why the FBUtilities.closeAll change? (going from Iterable<..> to List<..>)

          I pushed a few small fixes here as well

          And I think we need to test these scenarios;

          • how does nodetool garbagecollect work if there are 1000+ sstables?
          • run a repair on a vnode cluster with 100+GB (that usually creates a lot of sstables)
          Show
          krummas Marcus Eriksson added a comment - - edited comment on Rows#removeShadowedCells needs updating in bde4c02cc858a900d43028b9a930e805ab232c27 there seems to be a few unrelated fixes (like the AbstractRow hashCode fix for example), should we break them out in a separate ticket? (so that we get them in 3.0 as well) why the FBUtilities.closeAll change? (going from Iterable<..> to List<..>) I pushed a few small fixes here as well And I think we need to test these scenarios; how does nodetool garbagecollect work if there are 1000+ sstables? run a repair on a vnode cluster with 100+GB (that usually creates a lot of sstables)
          Hide
          blambov Branimir Lambov added a comment -

          Latest version here, with the extra logging removed and the comment updated. Let me know if you are OK with it to squash and rebase.

          in bde4c02cc858a900d43028b9a930e805ab232c27 there seems to be a few unrelated fixes (like the AbstractRow hashCode fix for example), should we break them out in a separate ticket? (so that we get them in 3.0 as well)

          The hashCode is the only fix there, extracted as CASSANDRA-11316. Another commit has a slight improvement in SizeTieredCompactionStrategy.getNextBackgroundSSTables which I don't think is worth a separate ticket.

          why the FBUtilities.closeAll change? (going from Iterable<..> to List<..>)

          I changed it from List to Iterable, but then realized I don't need it (FileUtils has the right method) and reverted the change. Patch includes no changes to FBUtilities.

          Show
          blambov Branimir Lambov added a comment - Latest version here , with the extra logging removed and the comment updated. Let me know if you are OK with it to squash and rebase. in bde4c02cc858a900d43028b9a930e805ab232c27 there seems to be a few unrelated fixes (like the AbstractRow hashCode fix for example), should we break them out in a separate ticket? (so that we get them in 3.0 as well) The hashCode is the only fix there, extracted as CASSANDRA-11316 . Another commit has a slight improvement in SizeTieredCompactionStrategy.getNextBackgroundSSTables which I don't think is worth a separate ticket. why the FBUtilities.closeAll change? (going from Iterable<..> to List<..>) I changed it from List to Iterable, but then realized I don't need it (FileUtils has the right method) and reverted the change. Patch includes no changes to FBUtilities.
          Hide
          philipthompson Philip Thompson added a comment -

          While running a long test with 7019, I'm seeing this OOM:

          ERROR [CompactionExecutor:7] 2016-04-26 07:34:46,758 CassandraDaemon.java:195 - Exception in thread Thread[CompactionExecutor:7,1,main]
          java.lang.OutOfMemoryError: null
          	at org.apache.cassandra.io.util.Memory.<init>(Memory.java:78) ~[main/:na]
          	at org.apache.cassandra.io.util.SafeMemory.<init>(SafeMemory.java:32) ~[main/:na]
          	at org.apache.cassandra.io.util.SafeMemory.copy(SafeMemory.java:72) ~[main/:na]
          	at org.apache.cassandra.io.util.SafeMemoryWriter.reallocate(SafeMemoryWriter.java:55) ~[main/:na]
          	at org.apache.cassandra.io.util.SafeMemoryWriter.setCapacity(SafeMemoryWriter.java:68) ~[main/:na]
          	at org.apache.cassandra.io.sstable.IndexSummaryBuilder.prepareToCommit(IndexSummaryBuilder.java:210) ~[main/:na]
          	at org.apache.cassandra.io.sstable.format.big.BigTableWriter$IndexWriter.doPrepare(BigTableWriter.java:473) ~[main/:na]
          	at org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional.prepareToCommit(Transactional.java:173) ~[main/:na]
          	at org.apache.cassandra.io.sstable.format.big.BigTableWriter$TransactionalProxy.doPrepare(BigTableWriter.java:307) ~[main/:na]
          	at org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional.prepareToCommit(Transactional.java:173) ~[main/:na]
          	at org.apache.cassandra.io.sstable.format.SSTableWriter.prepareToCommit(SSTableWriter.java:275) ~[main/:na]
          	at org.apache.cassandra.io.sstable.SSTableRewriter.doPrepare(SSTableRewriter.java:367) ~[main/:na]
          	at org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional.prepareToCommit(Transactional.java:173) ~[main/:na]
          	at org.apache.cassandra.db.compaction.writers.CompactionAwareWriter.doPrepare(CompactionAwareWriter.java:110) ~[main/:na]
          	at org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional.prepareToCommit(Transactional.java:173) ~[main/:na]
          	at org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional.finish(Transactional.java:184) ~[main/:na]
          	at org.apache.cassandra.db.compaction.writers.CompactionAwareWriter.finish(CompactionAwareWriter.java:120) ~[main/:na]
          	at org.apache.cassandra.db.compaction.CompactionTask.runMayThrow(CompactionTask.java:197) ~[main/:na]
          	at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) ~[main/:na]
          	at org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:82) ~[main/:na]
          	at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:60) ~[main/:na]
          	at org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionCandidate.run(CompactionManager.java:265) ~[main/:na]
          	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_40]
          	at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_40]
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[na:1.8.0_40]
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_40]
          	at java.lang.Thread.run(Thread.java:745) [na:1.8.0_40]
          

          This is with provide_overlapping_tombstones: NONE, however. I am also running my test on trunk, and have not seen the issue there yet, nor on any of my other runs. https://gist.github.com/ptnapoleon/ce59cc5a2a42ecfd3d27657f311e4615

          I've attached the debug.log of the node, as 7019-debug.log

          Show
          philipthompson Philip Thompson added a comment - While running a long test with 7019, I'm seeing this OOM: ERROR [CompactionExecutor:7] 2016-04-26 07:34:46,758 CassandraDaemon.java:195 - Exception in thread Thread [CompactionExecutor:7,1,main] java.lang.OutOfMemoryError: null at org.apache.cassandra.io.util.Memory.<init>(Memory.java:78) ~[main/:na] at org.apache.cassandra.io.util.SafeMemory.<init>(SafeMemory.java:32) ~[main/:na] at org.apache.cassandra.io.util.SafeMemory.copy(SafeMemory.java:72) ~[main/:na] at org.apache.cassandra.io.util.SafeMemoryWriter.reallocate(SafeMemoryWriter.java:55) ~[main/:na] at org.apache.cassandra.io.util.SafeMemoryWriter.setCapacity(SafeMemoryWriter.java:68) ~[main/:na] at org.apache.cassandra.io.sstable.IndexSummaryBuilder.prepareToCommit(IndexSummaryBuilder.java:210) ~[main/:na] at org.apache.cassandra.io.sstable.format.big.BigTableWriter$IndexWriter.doPrepare(BigTableWriter.java:473) ~[main/:na] at org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional.prepareToCommit(Transactional.java:173) ~[main/:na] at org.apache.cassandra.io.sstable.format.big.BigTableWriter$TransactionalProxy.doPrepare(BigTableWriter.java:307) ~[main/:na] at org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional.prepareToCommit(Transactional.java:173) ~[main/:na] at org.apache.cassandra.io.sstable.format.SSTableWriter.prepareToCommit(SSTableWriter.java:275) ~[main/:na] at org.apache.cassandra.io.sstable.SSTableRewriter.doPrepare(SSTableRewriter.java:367) ~[main/:na] at org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional.prepareToCommit(Transactional.java:173) ~[main/:na] at org.apache.cassandra.db.compaction.writers.CompactionAwareWriter.doPrepare(CompactionAwareWriter.java:110) ~[main/:na] at org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional.prepareToCommit(Transactional.java:173) ~[main/:na] at org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional.finish(Transactional.java:184) ~[main/:na] at org.apache.cassandra.db.compaction.writers.CompactionAwareWriter.finish(CompactionAwareWriter.java:120) ~[main/:na] at org.apache.cassandra.db.compaction.CompactionTask.runMayThrow(CompactionTask.java:197) ~[main/:na] at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) ~[main/:na] at org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:82) ~[main/:na] at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:60) ~[main/:na] at org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionCandidate.run(CompactionManager.java:265) ~[main/:na] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_40] at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_40] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[na:1.8.0_40] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_40] at java.lang. Thread .run( Thread .java:745) [na:1.8.0_40] This is with provide_overlapping_tombstones: NONE , however. I am also running my test on trunk, and have not seen the issue there yet, nor on any of my other runs. https://gist.github.com/ptnapoleon/ce59cc5a2a42ecfd3d27657f311e4615 I've attached the debug.log of the node, as 7019-debug.log
          Hide
          philipthompson Philip Thompson added a comment -

          Failure from another run: These are all on c3.4xlarge nodes, which have 30GB of RAM, which should be sufficient for the type of stress runs we're doing, no?

          ERROR [CompactionExecutor:6] 2016-04-26 02:23:42,296 CassandraDaemon.java:195 - Exception in thread Thread[CompactionExecutor:6,1,main]
          java.lang.RuntimeException: Out of native memory occured, You can avoid it by increasing the system ram space or by increasing bloom_filter_fp_chance.
          	at org.apache.cassandra.utils.obs.OffHeapBitSet.<init>(OffHeapBitSet.java:49) ~[main/:na]
          	at org.apache.cassandra.utils.FilterFactory.createFilter(FilterFactory.java:85) ~[main/:na]
          	at org.apache.cassandra.utils.FilterFactory.getFilter(FilterFactory.java:78) ~[main/:na]
          	at org.apache.cassandra.io.sstable.format.big.BigTableWriter$IndexWriter.<init>(BigTableWriter.java:382) ~[main/:na]
          	at org.apache.cassandra.io.sstable.format.big.BigTableWriter.<init>(BigTableWriter.java:87) ~[main/:na]
          	at org.apache.cassandra.io.sstable.format.big.BigFormat$WriterFactory.open(BigFormat.java:92) ~[main/:na]
          	at org.apache.cassandra.io.sstable.format.SSTableWriter.create(SSTableWriter.java:101) ~[main/:na]
          	at org.apache.cassandra.db.compaction.writers.MaxSSTableSizeWriter.switchCompactionLocation(MaxSSTableSizeWriter.java:96) ~[main/:na]
          	at org.apache.cassandra.db.compaction.writers.MaxSSTableSizeWriter.realAppend(MaxSSTableSizeWriter.java:86) ~[main/:na]
          	at org.apache.cassandra.db.compaction.writers.CompactionAwareWriter.append(CompactionAwareWriter.java:140) ~[main/:na]
          	at org.apache.cassandra.db.compaction.CompactionTask.runMayThrow(CompactionTask.java:186) ~[main/:na]
          	at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) ~[main/:na]
          	at org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:82) ~[main/:na]
          	at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:60) ~[main/:na]
          	at org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionCandidate.run(CompactionManager.java:265) ~[main/:na]
          	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_40]
          	at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_40]
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[na:1.8.0_40]
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_40]
          	at java.lang.Thread.run(Thread.java:745) [na:1.8.0_40]
          
          ERROR [Reference-Reaper:1] 2016-04-26 02:23:42,835 Ref.java:203 - LEAK DETECTED: a reference (org.apache.cassandra.utils.concurrent.Ref$State@5c307504) to class org.apache.cassandra.io.util.SafeMemory$MemoryTidy@1391956210:Memory@[7fa81dcab3e0..7fa81e639a60) was not released before the reference was garbage collected
          ERROR [Reference-Reaper:1] 2016-04-26 02:23:42,836 Ref.java:203 - LEAK DETECTED: a reference (org.apache.cassandra.utils.concurrent.Ref$State@56c81e74) to class org.apache.cassandra.io.util.SafeMemory$MemoryTidy@332045001:Memory@[7f92e23ce010..7f92e835e110) was not released before the reference was garbage collected
          

          Attached as 7019-2-system.log

          Show
          philipthompson Philip Thompson added a comment - Failure from another run: These are all on c3.4xlarge nodes, which have 30GB of RAM, which should be sufficient for the type of stress runs we're doing, no? ERROR [CompactionExecutor:6] 2016-04-26 02:23:42,296 CassandraDaemon.java:195 - Exception in thread Thread [CompactionExecutor:6,1,main] java.lang.RuntimeException: Out of native memory occured, You can avoid it by increasing the system ram space or by increasing bloom_filter_fp_chance. at org.apache.cassandra.utils.obs.OffHeapBitSet.<init>(OffHeapBitSet.java:49) ~[main/:na] at org.apache.cassandra.utils.FilterFactory.createFilter(FilterFactory.java:85) ~[main/:na] at org.apache.cassandra.utils.FilterFactory.getFilter(FilterFactory.java:78) ~[main/:na] at org.apache.cassandra.io.sstable.format.big.BigTableWriter$IndexWriter.<init>(BigTableWriter.java:382) ~[main/:na] at org.apache.cassandra.io.sstable.format.big.BigTableWriter.<init>(BigTableWriter.java:87) ~[main/:na] at org.apache.cassandra.io.sstable.format.big.BigFormat$WriterFactory.open(BigFormat.java:92) ~[main/:na] at org.apache.cassandra.io.sstable.format.SSTableWriter.create(SSTableWriter.java:101) ~[main/:na] at org.apache.cassandra.db.compaction.writers.MaxSSTableSizeWriter.switchCompactionLocation(MaxSSTableSizeWriter.java:96) ~[main/:na] at org.apache.cassandra.db.compaction.writers.MaxSSTableSizeWriter.realAppend(MaxSSTableSizeWriter.java:86) ~[main/:na] at org.apache.cassandra.db.compaction.writers.CompactionAwareWriter.append(CompactionAwareWriter.java:140) ~[main/:na] at org.apache.cassandra.db.compaction.CompactionTask.runMayThrow(CompactionTask.java:186) ~[main/:na] at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) ~[main/:na] at org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:82) ~[main/:na] at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:60) ~[main/:na] at org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionCandidate.run(CompactionManager.java:265) ~[main/:na] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_40] at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_40] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[na:1.8.0_40] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_40] at java.lang. Thread .run( Thread .java:745) [na:1.8.0_40] ERROR [Reference-Reaper:1] 2016-04-26 02:23:42,835 Ref.java:203 - LEAK DETECTED: a reference (org.apache.cassandra.utils.concurrent.Ref$State@5c307504) to class org.apache.cassandra.io.util.SafeMemory$MemoryTidy@1391956210:Memory@[7fa81dcab3e0..7fa81e639a60) was not released before the reference was garbage collected ERROR [Reference-Reaper:1] 2016-04-26 02:23:42,836 Ref.java:203 - LEAK DETECTED: a reference (org.apache.cassandra.utils.concurrent.Ref$State@56c81e74) to class org.apache.cassandra.io.util.SafeMemory$MemoryTidy@332045001:Memory@[7f92e23ce010..7f92e835e110) was not released before the reference was garbage collected Attached as 7019-2-system.log
          Hide
          philipthompson Philip Thompson added a comment -

          So I ran some ~day long tests, using the three new options from this ticket, none, row, and cell, and the parent commit on trunk, which I labeled control. I then grabbed all of the compaction throughput results from the debug.log files, and attached a very basic graph here, as temp-plot.html.

          The 20% delete workload had a mixture of partition, row, and cell deletions. We saw improved compaction throughput with 7019 when compared to trunk, and then expected but acceptable drops in compaction performance with the more aggressive tombstone removal options.

          I will begin dumping all the raw data I have, as well as the stress graphs so you can see how read/write perf is affected as well. I should note, at this time I have found nothing that would prevent me from being +1 on this ticket.

          Show
          philipthompson Philip Thompson added a comment - So I ran some ~day long tests, using the three new options from this ticket, none , row , and cell , and the parent commit on trunk, which I labeled control . I then grabbed all of the compaction throughput results from the debug.log files, and attached a very basic graph here, as temp-plot.html. The 20% delete workload had a mixture of partition, row, and cell deletions. We saw improved compaction throughput with 7019 when compared to trunk, and then expected but acceptable drops in compaction performance with the more aggressive tombstone removal options. I will begin dumping all the raw data I have, as well as the stress graphs so you can see how read/write perf is affected as well. I should note, at this time I have found nothing that would prevent me from being +1 on this ticket.
          Hide
          philipthompson Philip Thompson added a comment -

          I have attached tarballs with the compaction.log, debug.log, and stress log files for each test. I am not able to produce a functioning stress graph, probably due to their size

          Show
          philipthompson Philip Thompson added a comment - I have attached tarballs with the compaction.log, debug.log, and stress log files for each test. I am not able to produce a functioning stress graph, probably due to their size
          Hide
          philipthompson Philip Thompson added a comment - - edited

          Here is the relevant portion of the stress outputs:

          CONTROL:

          Results:
          Op rate                   :   21,050 op/s  [columndelete: 1,403 op/s, delete: 702 op/s, insert: 10,524 op/s, read: 7,017 op/s, rowdelete: 1,404 op/s]
          Partition rate            :   17,540 pk/s  [columndelete: 0 pk/s, delete: 0 pk/s, insert: 10,524 pk/s, read: 7,016 pk/s, rowdelete: 0 pk/s]
          Row rate                  :   43,872 row/s [columndelete: 0 row/s, delete: 0 row/s, insert: 26,324 row/s, read: 17,548 row/s, rowdelete: 0 row/s]
          Latency mean              :    2.4 ms [columndelete: 2.2 ms, delete: 2.1 ms, insert: 2.2 ms, read: 2.7 ms, rowdelete: 2.1 ms]
          Latency median            :    2.0 ms [columndelete: 1.8 ms, delete: 1.8 ms, insert: 1.9 ms, read: 2.3 ms, rowdelete: 1.8 ms]
          Latency 95th percentile   :    3.6 ms [columndelete: 3.3 ms, delete: 3.3 ms, insert: 3.4 ms, read: 4.1 ms, rowdelete: 3.3 ms]
          Latency 99th percentile   :    5.5 ms [columndelete: 4.8 ms, delete: 4.8 ms, insert: 4.9 ms, read: 7.2 ms, rowdelete: 4.7 ms]
          Latency 99.9th percentile :   75.2 ms [columndelete: 69.9 ms, delete: 69.9 ms, insert: 72.3 ms, read: 78.6 ms, rowdelete: 69.3 ms]
          Latency max               : 1032.3 ms [columndelete: 1,004.5 ms, delete: 1,004.5 ms, insert: 1,031.8 ms, read: 1,032.3 ms, rowdelete: 1,003.5 ms]
          Total partitions          : 378,840,394 [columndelete: 0, delete: 0, insert: 227,304,928, read: 151,535,466, rowdelete: 0]
          Total errors              :          0 [columndelete: 0, delete: 0, insert: 0, read: 0, rowdelete: 0]
          Total GC count            : 13,090
          Total GC memory           : 15900.717 GiB
          Total GC time             :  998.3 seconds
          Avg GC time               :   76.3 ms
          StdDev GC time            :   16.1 ms
          Total operation time      : 05:59:58
          

          NONE:

          Results:
          Op rate                   :   20,729 op/s  [columndelete: 1,382 op/s, delete: 690 op/s, insert: 10,366 op/s, read: 6,909 op/s, rowdelete: 1,382 op/s]
          Partition rate            :   17,274 pk/s  [columndelete: 0 pk/s, delete: 0 pk/s, insert: 10,366 pk/s, read: 6,908 pk/s, rowdelete: 0 pk/s]
          Row rate                  :   43,206 row/s [columndelete: 0 row/s, delete: 0 row/s, insert: 25,929 row/s, read: 17,277 row/s, rowdelete: 0 row/s]
          Latency mean              :    2.4 ms [columndelete: 2.1 ms, delete: 2.1 ms, insert: 2.2 ms, read: 2.9 ms, rowdelete: 2.1 ms]
          Latency median            :    1.9 ms [columndelete: 1.7 ms, delete: 1.7 ms, insert: 1.8 ms, read: 2.2 ms, rowdelete: 1.7 ms]
          Latency 95th percentile   :    3.3 ms [columndelete: 2.9 ms, delete: 2.9 ms, insert: 3.0 ms, read: 3.7 ms, rowdelete: 2.9 ms]
          Latency 99th percentile   :    4.7 ms [columndelete: 4.1 ms, delete: 4.1 ms, insert: 4.2 ms, read: 6.0 ms, rowdelete: 4.1 ms]
          Latency 99.9th percentile :   47.6 ms [columndelete: 12.0 ms, delete: 13.0 ms, insert: 14.7 ms, read: 67.9 ms, rowdelete: 13.4 ms]
          Latency max               : 1055.6 ms [columndelete: 1,006.0 ms, delete: 1,004.4 ms, insert: 1,055.6 ms, read: 1,055.6 ms, rowdelete: 1,031.9 ms]
          Total partitions          : 373,111,699 [columndelete: 0, delete: 0, insert: 223,905,059, read: 149,206,640, rowdelete: 0]
          Total errors              :          0 [columndelete: 0, delete: 0, insert: 0, read: 0, rowdelete: 0]
          Total GC count            : 14,082
          Total GC memory           : 17120.316 GiB
          Total GC time             : 1,005.7 seconds
          Avg GC time               :   71.4 ms
          StdDev GC time            :   13.5 ms
          Total operation time      : 06:00:00
          

          ROW:

          Results:
          Op rate                   :   16,121 op/s  [columndelete: 1,075 op/s, delete: 538 op/s, insert: 8,061 op/s, read: 5,372 op/s, rowdelete: 1,074 op/s]
          Partition rate            :   13,432 pk/s  [columndelete: 0 pk/s, delete: 0 pk/s, insert: 8,061 pk/s, read: 5,371 pk/s, rowdelete: 0 pk/s]
          Row rate                  :   33,597 row/s [columndelete: 0 row/s, delete: 0 row/s, insert: 20,165 row/s, read: 13,433 row/s, rowdelete: 0 row/s]
          Latency mean              :    3.1 ms [columndelete: 2.3 ms, delete: 2.3 ms, insert: 2.4 ms, read: 4.5 ms, rowdelete: 2.3 ms]
          Latency median            :    2.3 ms [columndelete: 1.8 ms, delete: 1.7 ms, insert: 1.8 ms, read: 3.5 ms, rowdelete: 1.8 ms]
          Latency 95th percentile   :    5.3 ms [columndelete: 3.8 ms, delete: 3.8 ms, insert: 3.8 ms, read: 6.8 ms, rowdelete: 3.8 ms]
          Latency 99th percentile   :    8.4 ms [columndelete: 6.1 ms, delete: 6.2 ms, insert: 6.2 ms, read: 10.9 ms, rowdelete: 6.3 ms]
          Latency 99.9th percentile :   51.4 ms [columndelete: 16.4 ms, delete: 15.7 ms, insert: 14.7 ms, read: 58.7 ms, rowdelete: 41.4 ms]
          Latency max               : 1053.8 ms [columndelete: 1,003.5 ms, delete: 1,053.7 ms, insert: 1,053.7 ms, read: 1,053.8 ms, rowdelete: 1,006.9 ms]
          Total partitions          : 290,123,117 [columndelete: 0, delete: 0, insert: 174,120,120, read: 116,002,997, rowdelete: 0]
          Total errors              :          0 [columndelete: 0, delete: 0, insert: 0, read: 0, rowdelete: 0]
          Total GC count            : 16,954
          Total GC memory           : 20848.789 GiB
          Total GC time             :  819.7 seconds
          Avg GC time               :   48.3 ms
          StdDev GC time            :   10.9 ms
          Total operation time      : 05:59:58
          

          CELL:

          Results:
          Op rate                   :    5,876 op/s  [columndelete: 392 op/s, delete: 196 op/s, insert: 2,937 op/s, read: 1,959 op/s, rowdelete: 392 op/s]
          Partition rate            :    4,896 pk/s  [columndelete: 0 pk/s, delete: 0 pk/s, insert: 2,937 pk/s, read: 1,959 pk/s, rowdelete: 0 pk/s]
          Row rate                  :   12,247 row/s [columndelete: 0 row/s, delete: 0 row/s, insert: 7,348 row/s, read: 4,899 row/s, rowdelete: 0 row/s]
          Latency mean              :    8.5 ms [columndelete: 3.3 ms, delete: 3.3 ms, insert: 3.4 ms, read: 18.8 ms, rowdelete: 3.3 ms]
          Latency median            :    3.2 ms [columndelete: 1.7 ms, delete: 1.7 ms, insert: 1.7 ms, read: 14.7 ms, rowdelete: 1.7 ms]
          Latency 95th percentile   :   25.6 ms [columndelete: 8.0 ms, delete: 7.9 ms, insert: 8.2 ms, read: 34.5 ms, rowdelete: 8.0 ms]
          Latency 99th percentile   :   39.6 ms [columndelete: 13.5 ms, delete: 13.6 ms, insert: 14.2 ms, read: 49.2 ms, rowdelete: 13.6 ms]
          Latency 99.9th percentile :   65.2 ms [columndelete: 35.2 ms, delete: 34.4 ms, insert: 34.5 ms, read: 76.2 ms, rowdelete: 31.3 ms]
          Latency max               : 1037.8 ms [columndelete: 208.5 ms, delete: 1,002.7 ms, insert: 1,006.2 ms, read: 1,037.8 ms, rowdelete: 231.9 ms]
          Total partitions          : 105,729,389 [columndelete: 0, delete: 0, insert: 63,434,249, read: 42,295,140, rowdelete: 0]
          Total errors              :          0 [columndelete: 0, delete: 0, insert: 0, read: 0, rowdelete: 0]
          Total GC count            : 14,437
          Total GC memory           : 17937.624 GiB
          Total GC time             :  348.4 seconds
          Avg GC time               :   24.1 ms
          StdDev GC time            :    7.1 ms
          Total operation time      : 05:59:54
          

          All were generated with the following stress command:

          cassandra-stress user duration=6h cl=LOCAL_ONE  profile=https://gist.githubusercontent.com/ptnapoleon/ce59cc5a2a42ecfd3d27657f311e4615/raw/fc85d9d715eb6ed87dba10796e505e2b5ef2c7f8/7019-none.yaml ops\(insert=15,read=10,delete=1,rowdelete=2,columndelete=2\) -rate threads=50  -pop  seq=1..750000000 -node 172.31.9.18 -graph file=/home/automaton/fallout_artifacts/delete_heavy-graph.html | tee /home/automaton/fallout_artifacts/stress-delete_heavy.log "

          with the obvious disparity that the linked yaml file changes the relevant compaction option in the schema.

          As you can see, the none and control options are very similar. row sees a ~20% dip in perf, and cell sees an incredibly large one, roughly 75% of throughput is lost.

          Show
          philipthompson Philip Thompson added a comment - - edited Here is the relevant portion of the stress outputs: CONTROL: Results: Op rate : 21,050 op/s [columndelete: 1,403 op/s, delete: 702 op/s, insert: 10,524 op/s, read: 7,017 op/s, rowdelete: 1,404 op/s] Partition rate : 17,540 pk/s [columndelete: 0 pk/s, delete: 0 pk/s, insert: 10,524 pk/s, read: 7,016 pk/s, rowdelete: 0 pk/s] Row rate : 43,872 row/s [columndelete: 0 row/s, delete: 0 row/s, insert: 26,324 row/s, read: 17,548 row/s, rowdelete: 0 row/s] Latency mean : 2.4 ms [columndelete: 2.2 ms, delete: 2.1 ms, insert: 2.2 ms, read: 2.7 ms, rowdelete: 2.1 ms] Latency median : 2.0 ms [columndelete: 1.8 ms, delete: 1.8 ms, insert: 1.9 ms, read: 2.3 ms, rowdelete: 1.8 ms] Latency 95th percentile : 3.6 ms [columndelete: 3.3 ms, delete: 3.3 ms, insert: 3.4 ms, read: 4.1 ms, rowdelete: 3.3 ms] Latency 99th percentile : 5.5 ms [columndelete: 4.8 ms, delete: 4.8 ms, insert: 4.9 ms, read: 7.2 ms, rowdelete: 4.7 ms] Latency 99.9th percentile : 75.2 ms [columndelete: 69.9 ms, delete: 69.9 ms, insert: 72.3 ms, read: 78.6 ms, rowdelete: 69.3 ms] Latency max : 1032.3 ms [columndelete: 1,004.5 ms, delete: 1,004.5 ms, insert: 1,031.8 ms, read: 1,032.3 ms, rowdelete: 1,003.5 ms] Total partitions : 378,840,394 [columndelete: 0, delete: 0, insert: 227,304,928, read: 151,535,466, rowdelete: 0] Total errors : 0 [columndelete: 0, delete: 0, insert: 0, read: 0, rowdelete: 0] Total GC count : 13,090 Total GC memory : 15900.717 GiB Total GC time : 998.3 seconds Avg GC time : 76.3 ms StdDev GC time : 16.1 ms Total operation time : 05:59:58 NONE: Results: Op rate : 20,729 op/s [columndelete: 1,382 op/s, delete: 690 op/s, insert: 10,366 op/s, read: 6,909 op/s, rowdelete: 1,382 op/s] Partition rate : 17,274 pk/s [columndelete: 0 pk/s, delete: 0 pk/s, insert: 10,366 pk/s, read: 6,908 pk/s, rowdelete: 0 pk/s] Row rate : 43,206 row/s [columndelete: 0 row/s, delete: 0 row/s, insert: 25,929 row/s, read: 17,277 row/s, rowdelete: 0 row/s] Latency mean : 2.4 ms [columndelete: 2.1 ms, delete: 2.1 ms, insert: 2.2 ms, read: 2.9 ms, rowdelete: 2.1 ms] Latency median : 1.9 ms [columndelete: 1.7 ms, delete: 1.7 ms, insert: 1.8 ms, read: 2.2 ms, rowdelete: 1.7 ms] Latency 95th percentile : 3.3 ms [columndelete: 2.9 ms, delete: 2.9 ms, insert: 3.0 ms, read: 3.7 ms, rowdelete: 2.9 ms] Latency 99th percentile : 4.7 ms [columndelete: 4.1 ms, delete: 4.1 ms, insert: 4.2 ms, read: 6.0 ms, rowdelete: 4.1 ms] Latency 99.9th percentile : 47.6 ms [columndelete: 12.0 ms, delete: 13.0 ms, insert: 14.7 ms, read: 67.9 ms, rowdelete: 13.4 ms] Latency max : 1055.6 ms [columndelete: 1,006.0 ms, delete: 1,004.4 ms, insert: 1,055.6 ms, read: 1,055.6 ms, rowdelete: 1,031.9 ms] Total partitions : 373,111,699 [columndelete: 0, delete: 0, insert: 223,905,059, read: 149,206,640, rowdelete: 0] Total errors : 0 [columndelete: 0, delete: 0, insert: 0, read: 0, rowdelete: 0] Total GC count : 14,082 Total GC memory : 17120.316 GiB Total GC time : 1,005.7 seconds Avg GC time : 71.4 ms StdDev GC time : 13.5 ms Total operation time : 06:00:00 ROW: Results: Op rate : 16,121 op/s [columndelete: 1,075 op/s, delete: 538 op/s, insert: 8,061 op/s, read: 5,372 op/s, rowdelete: 1,074 op/s] Partition rate : 13,432 pk/s [columndelete: 0 pk/s, delete: 0 pk/s, insert: 8,061 pk/s, read: 5,371 pk/s, rowdelete: 0 pk/s] Row rate : 33,597 row/s [columndelete: 0 row/s, delete: 0 row/s, insert: 20,165 row/s, read: 13,433 row/s, rowdelete: 0 row/s] Latency mean : 3.1 ms [columndelete: 2.3 ms, delete: 2.3 ms, insert: 2.4 ms, read: 4.5 ms, rowdelete: 2.3 ms] Latency median : 2.3 ms [columndelete: 1.8 ms, delete: 1.7 ms, insert: 1.8 ms, read: 3.5 ms, rowdelete: 1.8 ms] Latency 95th percentile : 5.3 ms [columndelete: 3.8 ms, delete: 3.8 ms, insert: 3.8 ms, read: 6.8 ms, rowdelete: 3.8 ms] Latency 99th percentile : 8.4 ms [columndelete: 6.1 ms, delete: 6.2 ms, insert: 6.2 ms, read: 10.9 ms, rowdelete: 6.3 ms] Latency 99.9th percentile : 51.4 ms [columndelete: 16.4 ms, delete: 15.7 ms, insert: 14.7 ms, read: 58.7 ms, rowdelete: 41.4 ms] Latency max : 1053.8 ms [columndelete: 1,003.5 ms, delete: 1,053.7 ms, insert: 1,053.7 ms, read: 1,053.8 ms, rowdelete: 1,006.9 ms] Total partitions : 290,123,117 [columndelete: 0, delete: 0, insert: 174,120,120, read: 116,002,997, rowdelete: 0] Total errors : 0 [columndelete: 0, delete: 0, insert: 0, read: 0, rowdelete: 0] Total GC count : 16,954 Total GC memory : 20848.789 GiB Total GC time : 819.7 seconds Avg GC time : 48.3 ms StdDev GC time : 10.9 ms Total operation time : 05:59:58 CELL: Results: Op rate : 5,876 op/s [columndelete: 392 op/s, delete: 196 op/s, insert: 2,937 op/s, read: 1,959 op/s, rowdelete: 392 op/s] Partition rate : 4,896 pk/s [columndelete: 0 pk/s, delete: 0 pk/s, insert: 2,937 pk/s, read: 1,959 pk/s, rowdelete: 0 pk/s] Row rate : 12,247 row/s [columndelete: 0 row/s, delete: 0 row/s, insert: 7,348 row/s, read: 4,899 row/s, rowdelete: 0 row/s] Latency mean : 8.5 ms [columndelete: 3.3 ms, delete: 3.3 ms, insert: 3.4 ms, read: 18.8 ms, rowdelete: 3.3 ms] Latency median : 3.2 ms [columndelete: 1.7 ms, delete: 1.7 ms, insert: 1.7 ms, read: 14.7 ms, rowdelete: 1.7 ms] Latency 95th percentile : 25.6 ms [columndelete: 8.0 ms, delete: 7.9 ms, insert: 8.2 ms, read: 34.5 ms, rowdelete: 8.0 ms] Latency 99th percentile : 39.6 ms [columndelete: 13.5 ms, delete: 13.6 ms, insert: 14.2 ms, read: 49.2 ms, rowdelete: 13.6 ms] Latency 99.9th percentile : 65.2 ms [columndelete: 35.2 ms, delete: 34.4 ms, insert: 34.5 ms, read: 76.2 ms, rowdelete: 31.3 ms] Latency max : 1037.8 ms [columndelete: 208.5 ms, delete: 1,002.7 ms, insert: 1,006.2 ms, read: 1,037.8 ms, rowdelete: 231.9 ms] Total partitions : 105,729,389 [columndelete: 0, delete: 0, insert: 63,434,249, read: 42,295,140, rowdelete: 0] Total errors : 0 [columndelete: 0, delete: 0, insert: 0, read: 0, rowdelete: 0] Total GC count : 14,437 Total GC memory : 17937.624 GiB Total GC time : 348.4 seconds Avg GC time : 24.1 ms StdDev GC time : 7.1 ms Total operation time : 05:59:54 All were generated with the following stress command: cassandra-stress user duration=6h cl=LOCAL_ONE profile=https://gist.githubusercontent.com/ptnapoleon/ce59cc5a2a42ecfd3d27657f311e4615/raw/fc85d9d715eb6ed87dba10796e505e2b5ef2c7f8/7019-none.yaml ops\(insert=15,read=10,delete=1,rowdelete=2,columndelete=2\) -rate threads=50 -pop seq=1..750000000 -node 172.31.9.18 -graph file=/home/automaton/fallout_artifacts/delete_heavy-graph.html | tee /home/automaton/fallout_artifacts/stress-delete_heavy.log " with the obvious disparity that the linked yaml file changes the relevant compaction option in the schema. As you can see, the none and control options are very similar. row sees a ~20% dip in perf, and cell sees an incredibly large one, roughly 75% of throughput is lost.
          Hide
          philipthompson Philip Thompson added a comment - - edited

          It does appear this needs rebased onto trunk, especially to run dtest, as CCM has the expectation that 3.8+ has CDC. I've attempted to run dtest on this branch with an older CCM to compensate:

          http://cassci.datastax.com/view/Dev/view/blambov/job/blambov-7019-rebased-dtest/7/

          Also linking unit tests:
          http://cassci.datastax.com/view/Dev/view/blambov/job/blambov-7019-rebased-testall/lastCompletedBuild/testReport/

          Show
          philipthompson Philip Thompson added a comment - - edited It does appear this needs rebased onto trunk, especially to run dtest, as CCM has the expectation that 3.8+ has CDC. I've attempted to run dtest on this branch with an older CCM to compensate: http://cassci.datastax.com/view/Dev/view/blambov/job/blambov-7019-rebased-dtest/7/ Also linking unit tests: http://cassci.datastax.com/view/Dev/view/blambov/job/blambov-7019-rebased-testall/lastCompletedBuild/testReport/
          Hide
          blambov Branimir Lambov added a comment -

          Rebased and uploaded again to the same branch:

          code utests dtests
          Show
          blambov Branimir Lambov added a comment - Rebased and uploaded again to the same branch: code utests dtests
          Hide
          krummas Marcus Eriksson added a comment -

          I suspect CASSANDRA-12071 hurts the performance here

          I'll do a final review of this today

          Show
          krummas Marcus Eriksson added a comment - I suspect CASSANDRA-12071 hurts the performance here I'll do a final review of this today
          Hide
          krummas Marcus Eriksson added a comment -

          code LGTM, just a few comments;

          • I don't see any unit tests using static rows or complex columns
          • We need more code comments, especially in GarbageSkippingUnfilteredRowIterator and the new methods in Cells and Rows
          • Could you add a dtest that runs nodetool garbagecollect?
          • Add an explanation of the syntax in the CompactionIteratorTest ("5<=[140] 10[150] [140]<40 60[150]" for example) - it is quite confusing otherwise

          nit:

          Philip Thompson could we run all dtests with -Ddefault.provide.overlapping.tombstones=ROW and -Ddefault.provide.overlapping.tombstones=CELL once? (there might be some tests where we assert that data is not removed during standard compaction, but we can manually check those)

          Show
          krummas Marcus Eriksson added a comment - code LGTM, just a few comments; I don't see any unit tests using static rows or complex columns We need more code comments, especially in GarbageSkippingUnfilteredRowIterator and the new methods in Cells and Rows Could you add a dtest that runs nodetool garbagecollect ? Add an explanation of the syntax in the CompactionIteratorTest ( "5<=[140] 10[150] [140]<40 60[150]" for example) - it is quite confusing otherwise nit: some trailing whitespace in the patch removed here Philip Thompson could we run all dtests with -Ddefault.provide.overlapping.tombstones=ROW and -Ddefault.provide.overlapping.tombstones=CELL once? (there might be some tests where we assert that data is not removed during standard compaction, but we can manually check those)
          Show
          philipthompson Philip Thompson added a comment - http://cassci.datastax.com/view/Dev/view/blambov/job/blambov-7019-rebased-CELL-dtest/ http://cassci.datastax.com/view/Dev/view/blambov/job/blambov-7019-rebased-ROW-dtest/
          Hide
          blambov Branimir Lambov added a comment -

          Rebased (which required a change in Scrubber.RowMergingSSTableIterator) and added a commit to apply your comments. Uploaded at the same location, plus a pull request for the dtest:

          code new dtest utests dtests
          Show
          blambov Branimir Lambov added a comment - Rebased (which required a change in Scrubber.RowMergingSSTableIterator ) and added a commit to apply your comments. Uploaded at the same location, plus a pull request for the dtest: code new dtest utests dtests
          Hide
          krummas Marcus Eriksson added a comment -

          +1, looks good to me

          could you add a NEWS.txt entry describing the feature and I'll get it committed?

          Show
          krummas Marcus Eriksson added a comment - +1, looks good to me could you add a NEWS.txt entry describing the feature and I'll get it committed?
          Hide
          blambov Branimir Lambov added a comment -

          Rebased, squashed and added CHANGES.txt and NEWS.txt entries. Pushed here:

          code new dtest utests dtests
          Show
          blambov Branimir Lambov added a comment - Rebased, squashed and added CHANGES.txt and NEWS.txt entries. Pushed here: code new dtest utests dtests
          Hide
          krummas Marcus Eriksson added a comment -

          and committed!

          Show
          krummas Marcus Eriksson added a comment - and committed!

            People

            • Assignee:
              blambov Branimir Lambov
              Reporter:
              krummas Marcus Eriksson
              Reviewer:
              Marcus Eriksson
            • Votes:
              9 Vote for this issue
              Watchers:
              40 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development