Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Fix Version/s: 2.1 beta2
    • Component/s: None
    • Labels:

      Description

      Related to CASSANDRA-6812, but a little simpler: when compacting, we mess quite badly with the page cache. One thing we can do to mitigate this problem is to use the sstable we're writing before we've finished writing it, and to drop the regions from the old sstables from the page cache as soon as the new sstables have them (even if they're only written to the page cache). This should minimise any page cache churn, as the old sstables must be larger than the new sstable, and since both will be in memory, dropping the old sstables is at least as good as dropping the new.

      The approach is quite straight-forward. Every X MB written:

      1. grab flushed length of index file;
      2. grab second to last index summary record, after excluding those that point to positions after the flushed length;
      3. open index file, and check that our last record doesn't occur outside of the flushed length of the data file (pretty unlikely)
      4. Open the sstable with the calculated upper bound

      Some complications:

      1. must keep running copy of compression metadata for reopening with
      2. we need to be able to replace an sstable with itself but a different lower bound
      3. we need to drop the old page cache only when readers have finished
      1. 6916.fix.txt
        1 kB
        Benedict
      2. 6916.fix2.txt
        8 kB
        Benedict
      3. 6916.fixup.txt
        4 kB
        Benedict
      4. 6916-stock2_1.mixed.cache_tweaks.tar.gz
        45 kB
        Ryan McGuire
      5. 6916-stock2_1.mixed.logs.tar.gz
        44 kB
        Ryan McGuire
      6. 6916v3-preempive-open-compact.logs.gz
        40 kB
        Ryan McGuire
      7. 6916v3-preempive-open-compact.mixed.2.logs.tar.gz
        42 kB
        Ryan McGuire
      8. 6916v3-premptive-open-compact.mixed.cache_tweaks.2.tar.gz
        47 kB
        Ryan McGuire

        Issue Links

          Activity

          Hide
          benedict Benedict added a comment -

          Here is a patch that adds this functionality, and also drops support for preheating page cache or dropping the page cache on writes, since they are no longer likely to provide any benefit above the standard behaviour with this patch.

          Show
          benedict Benedict added a comment - Here is a patch that adds this functionality, and also drops support for preheating page cache or dropping the page cache on writes, since they are no longer likely to provide any benefit above the standard behaviour with this patch.
          Hide
          krummas Marcus Eriksson added a comment -

          In general the patch looks good, a few comments;

          • Can we get some benchmarks done if this is worth it? I know intuitively it feels like it would be a big win, not only for the page cache, but also since we get to only read one sstable for the compacted keys, but still would be nice to have some evidence. I guess we could also do this to find a good default value for sstable_preemptive_open_interval_in_mb
          • Problem with fully expired sstables in CompactionTask, we need to replace "toCompact" instead of "actuallyCompact".
          • KeyCacheTest fails (not had time to look deeper into it)
          • A few comments around the replaceReader/replaceCompacted... and related in CompactionTask would be helpful, a bit harder to grasp now that we start using the compacted sstables earlier and then call DataTracker.replace(...) with an empty list.
          • In SSTR.cloneWithNewStart(..), CLibrary.trySkipCache(dfile.path, 0, 0); is called when newStart > last, is that on purpose? I guess we could drop the page cache for the whole file in this case?
          • In MMappedSegmentedFile, line 199, it looks like that ternary expression is not really needed since we add raf.length() in boundaries above?
          • Preheat comment left in CompactionTask

          We could probably do the same thing for the rest of the compactions (upgradesstables, scrub, etc)? Lets do that in a followup ticket (if at all)

          Show
          krummas Marcus Eriksson added a comment - In general the patch looks good, a few comments; Can we get some benchmarks done if this is worth it? I know intuitively it feels like it would be a big win, not only for the page cache, but also since we get to only read one sstable for the compacted keys, but still would be nice to have some evidence. I guess we could also do this to find a good default value for sstable_preemptive_open_interval_in_mb Problem with fully expired sstables in CompactionTask, we need to replace "toCompact" instead of "actuallyCompact". KeyCacheTest fails (not had time to look deeper into it) A few comments around the replaceReader/replaceCompacted... and related in CompactionTask would be helpful, a bit harder to grasp now that we start using the compacted sstables earlier and then call DataTracker.replace(...) with an empty list. In SSTR.cloneWithNewStart(..), CLibrary.trySkipCache(dfile.path, 0, 0); is called when newStart > last, is that on purpose? I guess we could drop the page cache for the whole file in this case? In MMappedSegmentedFile, line 199, it looks like that ternary expression is not really needed since we add raf.length() in boundaries above? Preheat comment left in CompactionTask We could probably do the same thing for the rest of the compactions (upgradesstables, scrub, etc)? Lets do that in a followup ticket (if at all)
          Hide
          benedict Benedict added a comment - - edited

          I know intuitively it feels like it would be a big win, not only for the page cache, but also since we get to only read one sstable for the compacted keys, but still would be nice to have some evidence.

          It would also be interesting to test this patch out against a constrained memory scenario, and see how well it behaves, as it should occupy much less memory at any given moment during a compaction. However, I think for the moment confirming we get the expected outcome of fixing CASSANDRA-6746 is enough in my book. I'll ask Ryan to run that test once I've addressed your points below.

          1. Problem with fully expired sstables in CompactionTask, we need to replace "toCompact" instead of "actuallyCompact".
          2. KeyCacheTest fails (not had time to look deeper into it)
          3. A few comments around the replaceReader/replaceCompacted... and related in CompactionTask would be helpful, a bit harder to grasp now that we start using the compacted sstables earlier and then call DataTracker.replace(...) with an empty list.
          4. In MMappedSegmentedFile, line 199, it looks like that ternary expression is not really needed since we add raf.length() in boundaries above?
          5. Preheat comment left in CompactionTask

          Good points, thanks. I'll check the failing test.

          In SSTR.cloneWithNewStart(..), CLibrary.trySkipCache(dfile.path, 0, 0); is called when newStart > last, is that on purpose? I guess we could drop the page cache for the whole file in this case?

          That's the idea, yes. A parameter of 0 indicates to drop the whole file.

          Show
          benedict Benedict added a comment - - edited I know intuitively it feels like it would be a big win, not only for the page cache, but also since we get to only read one sstable for the compacted keys, but still would be nice to have some evidence. It would also be interesting to test this patch out against a constrained memory scenario, and see how well it behaves, as it should occupy much less memory at any given moment during a compaction. However, I think for the moment confirming we get the expected outcome of fixing CASSANDRA-6746 is enough in my book. I'll ask Ryan to run that test once I've addressed your points below. Problem with fully expired sstables in CompactionTask, we need to replace "toCompact" instead of "actuallyCompact". KeyCacheTest fails (not had time to look deeper into it) A few comments around the replaceReader/replaceCompacted... and related in CompactionTask would be helpful, a bit harder to grasp now that we start using the compacted sstables earlier and then call DataTracker.replace(...) with an empty list. In MMappedSegmentedFile, line 199, it looks like that ternary expression is not really needed since we add raf.length() in boundaries above? Preheat comment left in CompactionTask Good points, thanks. I'll check the failing test. In SSTR.cloneWithNewStart(..), CLibrary.trySkipCache(dfile.path, 0, 0); is called when newStart > last, is that on purpose? I guess we could drop the page cache for the whole file in this case? That's the idea, yes. A parameter of 0 indicates to drop the whole file.
          Hide
          jbellis Jonathan Ellis added a comment -

          We could probably do the same thing for the rest of the compactions (upgradesstables, scrub, etc)? Lets do that in a followup ticket (if at all)

          What stops us from generalizing to an API of new SSTableWriter(List<SSTR> sstablesToReplace ?

          Show
          jbellis Jonathan Ellis added a comment - We could probably do the same thing for the rest of the compactions (upgradesstables, scrub, etc)? Lets do that in a followup ticket (if at all) What stops us from generalizing to an API of new SSTableWriter(List<SSTR> sstablesToReplace ?
          Hide
          benedict Benedict added a comment -

          Probably nothing - it looks like an eminently sensible idea. I'll give it a go for my next version and let you know if it goes pear shaped

          FTR, if possible, I'll probably extend SSTableWriter to an SSTableRewriter, to keep the functionality separate.

          Show
          benedict Benedict added a comment - Probably nothing - it looks like an eminently sensible idea. I'll give it a go for my next version and let you know if it goes pear shaped FTR, if possible, I'll probably extend SSTableWriter to an SSTableRewriter, to keep the functionality separate.
          Hide
          krummas Marcus Eriksson added a comment -

          What stops us from generalizing

          Sounds good, i just figured we should spend some time on benchmarking before doing that, to make sure it is not wasted if the benchmarks don't show any improvements.

          Show
          krummas Marcus Eriksson added a comment - What stops us from generalizing Sounds good, i just figured we should spend some time on benchmarking before doing that, to make sure it is not wasted if the benchmarks don't show any improvements.
          Hide
          krummas Marcus Eriksson added a comment -

          That's the idea, yes. A parameter of 0 indicates to drop the whole file.

          trySkipCache with a string path does "while (len > 0)" - and with len = 0, it won't do anything (unless my caffeine levels are too low and I miss something obvious)

          also realized now that we need to close the RandomAccessFile in CLibrary.trySkipCache(String path, long offset, long len)

          Show
          krummas Marcus Eriksson added a comment - That's the idea, yes. A parameter of 0 indicates to drop the whole file. trySkipCache with a string path does "while (len > 0)" - and with len = 0, it won't do anything (unless my caffeine levels are too low and I miss something obvious) also realized now that we need to close the RandomAccessFile in CLibrary.trySkipCache(String path, long offset, long len)
          Hide
          benedict Benedict added a comment -

          doh. Yes, I should be passing the 0 straght through! Good spots.

          Show
          benedict Benedict added a comment - doh. Yes, I should be passing the 0 straght through! Good spots.
          Hide
          benedict Benedict added a comment -

          Uploaded a new version here. Introduced a new SSTableRewriter that we now use in scrub, upgrade, clean, compaction and anti-compaction. On the whole I think this actually makes the code in these classes a little neater. Also added code to smoothly transfer key cache items over at the same time.

          Ryan McGuire could you try this patch out against CASSANDRA-6746? Would be interested to see for a mixed workload as well...

          Show
          benedict Benedict added a comment - Uploaded a new version here . Introduced a new SSTableRewriter that we now use in scrub, upgrade, clean, compaction and anti-compaction. On the whole I think this actually makes the code in these classes a little neater. Also added code to smoothly transfer key cache items over at the same time. Ryan McGuire could you try this patch out against CASSANDRA-6746 ? Would be interested to see for a mixed workload as well...
          Show
          enigmacurry Ryan McGuire added a comment - Benedict Fantastic! http://riptano.github.io/cassandra_performance/graph/graph.html?stats=stats.6916v3-preempive-open-compact.json&metric=op_rate&operation=read Working on a mixed load now.
          Hide
          benedict Benedict added a comment -

          Great! any chance you could upload the logs for the last run? pretty happy it's faster for writes but would like to see what the explanation of the bigger drop in throughput was.

          Show
          benedict Benedict added a comment - Great! any chance you could upload the logs for the last run? pretty happy it's faster for writes but would like to see what the explanation of the bigger drop in throughput was.
          Hide
          enigmacurry Ryan McGuire added a comment - - edited

          Benedict logs attached: 6916v3-preempive-open-compact.logs.gz

          Show
          enigmacurry Ryan McGuire added a comment - - edited Benedict logs attached: 6916v3-preempive-open-compact.logs.gz
          Hide
          tjake T Jake Luciani added a comment -

          Just for historical purposes we tried a different approach in 0.7 CASSANDRA-1902

          Show
          tjake T Jake Luciani added a comment - Just for historical purposes we tried a different approach in 0.7 CASSANDRA-1902
          Hide
          enigmacurry Ryan McGuire added a comment -

          Benedict Yea, this is killer. Here's a mixed read/write workload:

          http://riptano.github.io/cassandra_performance/graph_v3/graph.html?stats=stats.6916v3-preempive-open-compact.mixed.2.json&metric=op_rate&operation=mixed

          Logs are attached as: 6916v3-preempive-open-compact.mixed.2.logs.tar.gz

          Show
          enigmacurry Ryan McGuire added a comment - Benedict Yea, this is killer. Here's a mixed read/write workload: http://riptano.github.io/cassandra_performance/graph_v3/graph.html?stats=stats.6916v3-preempive-open-compact.mixed.2.json&metric=op_rate&operation=mixed Logs are attached as: 6916v3-preempive-open-compact.mixed.2.logs.tar.gz
          Hide
          benedict Benedict added a comment -

          Thanks Ryan McGuire. Could I get the logs for the no-patch run for writes as well to compare against?

          Also, could we do one last test comparing patch against version with both populate page cache properties set to true? Since this is effectively the "regression" we're competing with, just want to check we still perform as well.

          Show
          benedict Benedict added a comment - Thanks Ryan McGuire . Could I get the logs for the no-patch run for writes as well to compare against? Also, could we do one last test comparing patch against version with both populate page cache properties set to true? Since this is effectively the "regression" we're competing with, just want to check we still perform as well.
          Hide
          benedict Benedict added a comment -

          It's worth noting for other viewers of this ticket that this patch seems to reduce worst write latency quite noticeably, which is a nice side effect. I'm guessing this is because the OS has more buffer cache to play with to keep IO scheduled effectively.

          Marcus Eriksson wdyt of the new patch?

          Show
          benedict Benedict added a comment - It's worth noting for other viewers of this ticket that this patch seems to reduce worst write latency quite noticeably, which is a nice side effect. I'm guessing this is because the OS has more buffer cache to play with to keep IO scheduled effectively. Marcus Eriksson wdyt of the new patch?
          Hide
          enigmacurry Ryan McGuire added a comment -

          Logs for the write,mixed run on stock 2.1: 6916-stock2_1.mixed.logs.tar.gz

          Show
          enigmacurry Ryan McGuire added a comment - Logs for the write,mixed run on stock 2.1: 6916-stock2_1.mixed.logs.tar.gz
          Hide
          benedict Benedict added a comment -

          Thanks, everything looks good to me. Once we double check against stock 2.1 with populate page cache enabled to check there's no surprises, I'm happy to commit.

          Show
          benedict Benedict added a comment - Thanks, everything looks good to me. Once we double check against stock 2.1 with populate page cache enabled to check there's no surprises, I'm happy to commit.
          Hide
          enigmacurry Ryan McGuire added a comment - - edited

          Benedict here's the last test you wanted:

          Testplan:

          • Set preheat_kernel_page_cache:true in yaml.
          • Startup C*, create the keyspace and CFs stress would, modifying populate_io_cache_on_flush to true.
          • Regular stress write (same as other tests above)
          • mixed stress write (same as other tests above)

          You had told me offline that those settings were not applicable to your patch, but I went ahead and tried them anyway, apples to apples. Results with those settings for both stock 2.1 and your patched version:

          http://riptano.github.io/cassandra_performance/graph_v3/graph.html?stats=stats.6916v3-preempive-open-compact.mixed.cache_tweaks.2.json&metric=op_rate&operation=mixed

          I noticed that on your branch I got several timeouts during the mixed operation. The second trial I tried (green line) timed out completely midway through the mixed operation. Check out the max latencies metric on that graph.

          However, doing the original test you wanted me to do, which is setting populate_io_cache_on_flush:true and preheat_kernel_page_cache:true ONLY on stock 2.1, comparing to your branch without those settings, I get:

          http://riptano.github.io/cassandra_performance/graph_v3/graph.html?stats=stats.6916v3-preempive-open-compact.mixed.cache_tweaks.3.json&metric=op_rate&operation=mixed

          Which still shows slightly better that stock. So, it does appear that those settings still affect something.

          New logs:

          • stock C* 2.1 with cache settings on : 6916-stock2_1.mixed.cache_tweaks.tar.gz
          • 6916v3-premptive-open-compact with cache settings on: 6916v3-premptive-open-compact.mixed.cache_tweaks.2.tar.gz

          There's some interesting errors in that second set of logs, worth checking out.

          Furthermore, I hadn't realized when testing CASSANDRA-6746 that we could actually fare well with the existing options like this, just that they weren't on by default. It would be great though if the default settings made this test pass, which your branch does.

          Show
          enigmacurry Ryan McGuire added a comment - - edited Benedict here's the last test you wanted: Testplan: Set preheat_kernel_page_cache:true in yaml. Startup C*, create the keyspace and CFs stress would, modifying populate_io_cache_on_flush to true. Regular stress write (same as other tests above) mixed stress write (same as other tests above) You had told me offline that those settings were not applicable to your patch, but I went ahead and tried them anyway, apples to apples. Results with those settings for both stock 2.1 and your patched version: http://riptano.github.io/cassandra_performance/graph_v3/graph.html?stats=stats.6916v3-preempive-open-compact.mixed.cache_tweaks.2.json&metric=op_rate&operation=mixed I noticed that on your branch I got several timeouts during the mixed operation. The second trial I tried (green line) timed out completely midway through the mixed operation. Check out the max latencies metric on that graph. However, doing the original test you wanted me to do, which is setting populate_io_cache_on_flush:true and preheat_kernel_page_cache:true ONLY on stock 2.1, comparing to your branch without those settings, I get: http://riptano.github.io/cassandra_performance/graph_v3/graph.html?stats=stats.6916v3-preempive-open-compact.mixed.cache_tweaks.3.json&metric=op_rate&operation=mixed Which still shows slightly better that stock. So, it does appear that those settings still affect something. New logs: stock C* 2.1 with cache settings on : 6916-stock2_1.mixed.cache_tweaks.tar.gz 6916v3-premptive-open-compact with cache settings on: 6916v3-premptive-open-compact.mixed.cache_tweaks.2.tar.gz There's some interesting errors in that second set of logs, worth checking out. Furthermore, I hadn't realized when testing CASSANDRA-6746 that we could actually fare well with the existing options like this, just that they weren't on by default. It would be great though if the default settings made this test pass, which your branch does.
          Hide
          benedict Benedict added a comment -

          Ryan McGuire: 6916v3 should crash if you set the preheat_kernel_page_cache property to true (just tested this locally). Setting the populate_io_cache_on_flush property of the CF would probably work but simply have no effect. Are you sure you were running the correct branch?

          Furthermore, I hadn't realized when testing CASSANDRA-6746 that we could actually fare well with the existing options like this

          The problem is that we consider the default to be better at preventing dramatic page cache churn during compaction, which this should continue to deliver but without the downsides.

          The errors look plausible - but could we confirm we're running the correct (v3) branch given it didn't crash with the preheat setting in the yaml?

          Show
          benedict Benedict added a comment - Ryan McGuire : 6916v3 should crash if you set the preheat_kernel_page_cache property to true (just tested this locally). Setting the populate_io_cache_on_flush property of the CF would probably work but simply have no effect. Are you sure you were running the correct branch? Furthermore, I hadn't realized when testing CASSANDRA-6746 that we could actually fare well with the existing options like this The problem is that we consider the default to be better at preventing dramatic page cache churn during compaction, which this should continue to deliver but without the downsides. The errors look plausible - but could we confirm we're running the correct (v3) branch given it didn't crash with the preheat setting in the yaml?
          Hide
          krummas Marcus Eriksson added a comment -

          new patch comments;

          • cleanup is broken, first the log message causes npe since newSstable is null, secondly, an immutableSet is passed as rewriting to SSTableRewriter, causing moveStarts to fail since it calls removeAll on that set. Both fixed here: https://github.com/krummas/cassandra/commits/bes/6916-2
          • a bit of javadoc around moveStarts could be helpful, just looking at the api now makes it feel like we are moving the start of newReader
          • when doing anticompaction, should we not clean up old readers? (repairedSSTableWriter.finish(false, repairedAt)
          • do we need to move starts back in SSTableRewriter.resetAndTruncate()? If we resetAndTruncate right after doing early opening, i think we could create a gap between the start in the compacting file and the end in the written one
          Show
          krummas Marcus Eriksson added a comment - new patch comments; cleanup is broken, first the log message causes npe since newSstable is null, secondly, an immutableSet is passed as rewriting to SSTableRewriter, causing moveStarts to fail since it calls removeAll on that set. Both fixed here: https://github.com/krummas/cassandra/commits/bes/6916-2 a bit of javadoc around moveStarts could be helpful, just looking at the api now makes it feel like we are moving the start of newReader when doing anticompaction, should we not clean up old readers? (repairedSSTableWriter.finish(false, repairedAt) do we need to move starts back in SSTableRewriter.resetAndTruncate()? If we resetAndTruncate right after doing early opening, i think we could create a gap between the start in the compacting file and the end in the written one
          Hide
          benedict Benedict added a comment -

          when doing anticompaction, should we not clean up old readers? (repairedSSTableWriter.finish(false, repairedAt)

          We have two writers open with the same readers here, so we only close the readers when we finish the second writer. It's a bit clunky, I know, but it's not a common occurrence to be rewriting into two places. I'll add comments.

          do we need to move starts back in SSTableRewriter.resetAndTruncate()? If we resetAndTruncate right after doing early opening, i think we could create a gap between the start in the compacting file and the end in the written one

          We always open before performing any append, and open with exclusive upper bounds, i.e. we should only ever truncate back to a position we're still safe at. That said, this is definitely worthy of a comment or two.

          I've copied your tweaks and also added another fix: wasn't dealing with marking compacted correctly everywhere. Need to mark the compaction finished before deleting the old files, and mark the preemptively opened reader as compacting before adding it to the live set to avoid it being compacted before it's written (this was in the earlier patches but dropped out somewhere along the way)

          Also, I left the new HashSet() inside of the Rewriter out, as I want the provided set to be mutable, so that the users of the rewriter can have access to the currently extant versions. It's not actually necessary anywhere, but I think it will prevent future surprises.

          I've uploaded to the repository, but intend to add some comments to the places we mentioned. The functional stuff should be static now, if you're happy with it.

          Show
          benedict Benedict added a comment - when doing anticompaction, should we not clean up old readers? (repairedSSTableWriter.finish(false, repairedAt) We have two writers open with the same readers here, so we only close the readers when we finish the second writer. It's a bit clunky, I know, but it's not a common occurrence to be rewriting into two places. I'll add comments. do we need to move starts back in SSTableRewriter.resetAndTruncate()? If we resetAndTruncate right after doing early opening, i think we could create a gap between the start in the compacting file and the end in the written one We always open before performing any append, and open with exclusive upper bounds, i.e. we should only ever truncate back to a position we're still safe at. That said, this is definitely worthy of a comment or two. I've copied your tweaks and also added another fix: wasn't dealing with marking compacted correctly everywhere. Need to mark the compaction finished before deleting the old files, and mark the preemptively opened reader as compacting before adding it to the live set to avoid it being compacted before it's written (this was in the earlier patches but dropped out somewhere along the way) Also, I left the new HashSet() inside of the Rewriter out, as I want the provided set to be mutable, so that the users of the rewriter can have access to the currently extant versions. It's not actually necessary anywhere, but I think it will prevent future surprises. I've uploaded to the repository, but intend to add some comments to the places we mentioned. The functional stuff should be static now, if you're happy with it.
          Hide
          enigmacurry Ryan McGuire added a comment -

          doh! I am using the right branch, but I was wrong about using the preheat_kernel_page_cache setting on it. My test harness was trying to be smart by introspecting Config.java to prevalidate config settings, and it simply tossed the setting out when it saw that it wasn't going to work. I either need to remove that bit of code, or remember why I thought it was a good idea in the first place, so I don't confuse myself like this again!

          The timeouts I was seeing appear to be entirely related to the populate_io_cache_on_flush setting on the CF. Which, appears to be gone in the latest version of your branch, and I've retested and saw no timeouts, so I think we're good.

          Show
          enigmacurry Ryan McGuire added a comment - doh! I am using the right branch, but I was wrong about using the preheat_kernel_page_cache setting on it. My test harness was trying to be smart by introspecting Config.java to prevalidate config settings, and it simply tossed the setting out when it saw that it wasn't going to work. I either need to remove that bit of code, or remember why I thought it was a good idea in the first place, so I don't confuse myself like this again! The timeouts I was seeing appear to be entirely related to the populate_io_cache_on_flush setting on the CF. Which, appears to be gone in the latest version of your branch, and I've retested and saw no timeouts, so I think we're good.
          Hide
          benedict Benedict added a comment -

          I think the timeouts were actually caused by my not updating FileCacheService (and there being a race condition or two in there that were exacerbated by this change) - I've fixed these in my branch now, so that's probably why everything looks good

          Show
          benedict Benedict added a comment - I think the timeouts were actually caused by my not updating FileCacheService (and there being a race condition or two in there that were exacerbated by this change) - I've fixed these in my branch now, so that's probably why everything looks good
          Hide
          krummas Marcus Eriksson added a comment -

          ok, i'm happy with the functionality now, few last comments;

          • Seems we are caching the keys twice, both in SSTableRewriter.maybeReopenEarly and moveStarts
          • We should perhaps invalidate the new keys cached when aborting the SSTRW
          Show
          krummas Marcus Eriksson added a comment - ok, i'm happy with the functionality now, few last comments; Seems we are caching the keys twice, both in SSTableRewriter.maybeReopenEarly and moveStarts We should perhaps invalidate the new keys cached when aborting the SSTRW
          Hide
          benedict Benedict added a comment -

          We should perhaps invalidate the new keys cached when aborting the SSTRW

          I'm not convinced the complexity is really worth it; aborting only happens when something goes wrong, and I want to simply clean up my resources as quickly and safely as possible... walking through the readers, looking up their cache keys and swapping them back seems to me to be quite an expensive cleanup operation that might leave us failing to cleanup resources properly. I don't feel too strongly either way though.

          Seems we are caching the keys twice, both in SSTableRewriter.maybeReopenEarly and moveStarts

          Good spot!

          Okay, just uploaded a new version with that fixed, increased comments and also better encapsulation of mark() and resetAndTruncate() within a new tryAppend method, which more easily enapsulates/demonstrates the safety of the operation

          Show
          benedict Benedict added a comment - We should perhaps invalidate the new keys cached when aborting the SSTRW I'm not convinced the complexity is really worth it; aborting only happens when something goes wrong, and I want to simply clean up my resources as quickly and safely as possible... walking through the readers, looking up their cache keys and swapping them back seems to me to be quite an expensive cleanup operation that might leave us failing to cleanup resources properly. I don't feel too strongly either way though. Seems we are caching the keys twice, both in SSTableRewriter.maybeReopenEarly and moveStarts Good spot! Okay, just uploaded a new version with that fixed, increased comments and also better encapsulation of mark() and resetAndTruncate() within a new tryAppend method, which more easily enapsulates/demonstrates the safety of the operation
          Hide
          benedict Benedict added a comment -

          Rebased, squashed and pushed to 6916-final

          Show
          benedict Benedict added a comment - Rebased, squashed and pushed to 6916-final
          Hide
          krummas Marcus Eriksson added a comment -

          committed!

          Show
          krummas Marcus Eriksson added a comment - committed!
          Hide
          tjake T Jake Luciani added a comment -

          This is a really interesting ticket! I'm just looking at the final patch and wondering about how this works with leveled compaction. Aren't the sstables being merged from the previous level updated with the new level at the end of compaction?

          What level does the partial sstable read use? Are these considered part of level 0 till they are fully written?

          More generally, is there a stress test we have that uses wide rows? I think it's important we stress both as Leveled is becoming more widely used.

          Show
          tjake T Jake Luciani added a comment - This is a really interesting ticket! I'm just looking at the final patch and wondering about how this works with leveled compaction. Aren't the sstables being merged from the previous level updated with the new level at the end of compaction? What level does the partial sstable read use? Are these considered part of level 0 till they are fully written? More generally, is there a stress test we have that uses wide rows? I think it's important we stress both as Leveled is becoming more widely used.
          Hide
          benedict Benedict added a comment -

          What level does the partial sstable read use? Are these considered part of level 0 till they are fully written?

          It doesn't - it just relies on being present in the interval tree. It doesn't need a level except for purposes of compaction, which is a bit premature when it's not even finished yet

          More generally, is there a stress test we have that uses wide rows?

          No, if by wide you mean CQL composite keys - this is something I want to address very soon. It's a pretty simple addition to the current stress tool, and there have been a number of tickets recently where it would have been very helpful. You're welcome to take a look at it yourself. In this instance I don't think it would make a real difference though.

          Show
          benedict Benedict added a comment - What level does the partial sstable read use? Are these considered part of level 0 till they are fully written? It doesn't - it just relies on being present in the interval tree. It doesn't need a level except for purposes of compaction, which is a bit premature when it's not even finished yet More generally, is there a stress test we have that uses wide rows? No, if by wide you mean CQL composite keys - this is something I want to address very soon. It's a pretty simple addition to the current stress tool, and there have been a number of tickets recently where it would have been very helpful. You're welcome to take a look at it yourself. In this instance I don't think it would make a real difference though.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          if by wide you mean CQL composite keys

          By wide we usually mean just partitions with a metric ton of cells, composite or not.

          Show
          iamaleksey Aleksey Yeschenko added a comment - if by wide you mean CQL composite keys By wide we usually mean just partitions with a metric ton of cells, composite or not.
          Hide
          benedict Benedict added a comment -

          Missed in if (offline) test in CompactionTask to ensure offline split works. Not totally sure why it didn't break before this patch though, but either way was an oversight. Attaching fix that also adds class level javadoc to SSTableRewriter

          Show
          benedict Benedict added a comment - Missed in if (offline) test in CompactionTask to ensure offline split works. Not totally sure why it didn't break before this patch though, but either way was an oversight. Attaching fix that also adds class level javadoc to SSTableRewriter
          Hide
          krummas Marcus Eriksson added a comment -

          committed

          Show
          krummas Marcus Eriksson added a comment - committed
          Hide
          benedict Benedict added a comment -

          Spotted a bug in abort() - we close the bloom filter when aborting an SSTableWriter but now we may already have it opened for read. Attaching simple fix to ensure we only close the BF if we haven't sent a reference elsewhere

          Show
          benedict Benedict added a comment - Spotted a bug in abort() - we close the bloom filter when aborting an SSTableWriter but now we may already have it opened for read. Attaching simple fix to ensure we only close the BF if we haven't sent a reference elsewhere
          Hide
          krummas Marcus Eriksson added a comment -

          committed

          Show
          krummas Marcus Eriksson added a comment - committed
          Hide
          benedict Benedict added a comment -

          One more bug: this breaks snapshots, as they think each SSTableReader is backed by an immutable file on disk. Attaching simple patch that skips those readers that are "isOpenEarly"

          Show
          benedict Benedict added a comment - One more bug: this breaks snapshots, as they think each SSTableReader is backed by an immutable file on disk. Attaching simple patch that skips those readers that are "isOpenEarly"
          Hide
          krummas Marcus Eriksson added a comment -

          committed

          Show
          krummas Marcus Eriksson added a comment - committed

            People

            • Assignee:
              benedict Benedict
              Reporter:
              benedict Benedict
              Reviewer:
              Marcus Eriksson
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development