Details

    • Type: New Feature
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Fix Version/s: 4.x
    • Component/s: None

      Description

      Broken out from CASSANDRA-6696, we should split sstables based on ranges during compaction.

      Requirements;

      • dont create tiny sstables - keep them bunched together until a single vnode is big enough (configurable how big that is)
      • make it possible to run existing compaction strategies on the per-range sstables

      We should probably add a global compaction strategy parameter that states whether this should be enabled or not.

        Issue Links

          Activity

          Hide
          krummas Marcus Eriksson added a comment -

          Pushed a new branch here for early feedback, still needs cleanup and tests

          Enable like this:

          ALTER TABLE x.y WITH compaction={'class':'LeveledCompactionStrategy', 'range_aware_compaction':'true', 'min_range_sstable_size_in_mb':'15'}
          • Run a compaction strategy instance per owned range (with num_tokens=256 and rf=3, we will have 768 * 2 instances (repaired/unrepaired data)).
          • To avoid getting very many tiny sstables in the per-range strategies, we keep them outside the strategy until the estimated size of a range-sstable is larger than 'min_range_sstable_size_in_mb'. (estimation usually gets within a few % of the actual value).
          • We do STCS among the many-range-sstables (called "L0" which might not be optimal due to LCS)
          • We currently prioritize compaction in L0 to get sstables out of there as quickly as possible
          • If an sstable fits within a range, it is added to that corresponding range-compaction strategy - this should avoid getting a lot of L0 sstables after streaming for example
          • Adds a describecompactionstrategy nodetool command which displays information about the configured compaction strategy (like sstables per range etc). Example with only unrepaired data and 2 data directories - we first split the owned ranges over those 2 directories, and then we split on a per range basis, so the first RangeAwareCompactionStrategy is responsible for half the data and the second one is responsible for the rest:
             $ bin/nodetool describecompactionstrategy keyspace1 standard1
            
            -------------------------------------------------- keyspace1.standard1 --------------------------------------------------
            Strategy=class org.apache.cassandra.db.compaction.RangeAwareCompactionStrategy, for 167 unrepaired sstables, boundary tokens=min(-9223372036854775808) -> max(-4095785201827646), location=/home/marcuse/c/d1
            Inner strategy: class org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy (257 instances, 162 total sstables)
              sstable counts: 
                        0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
                      ------------------------------------------------------------------------------------------
              0.. 29 |  1  3  0  0  2  3  0  3  3  0  3  0  2  1  0  1  0  1  0  3  3  4  1  0  3  1  0  0  0  0
             30.. 59 |  0  0  0  3  0  2  2  0  3  0  3  3  0  1  3  3  3  0  2  0  1  2  0  0  0  1  0  3  0  0
             60.. 89 |  1  0  0  1  1  1  1  0  1  0  2  3  1  0  3  1  2  3  2  0  0  3  2  1  1  0  0  2  3  1
             90..119 |  0  1  2  0  0  3  0  3  3  1  0  0  3  0  2  0  2  0  2  1  3  0  2  1  1  3  1  0  3  0
            120..149 |  2  0  3  1  3  0  0  3  3  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
            150..179 |  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
            180..209 |  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
            210..239 |  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
            240..257 |  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
            Strategy=class org.apache.cassandra.db.compaction.RangeAwareCompactionStrategy, for 221 unrepaired sstables, boundary tokens=max(-4095785201827646) -> max(9223372036854775807), location=/var/lib/c1
            Inner strategy: class org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy (257 instances, 215 total sstables)
              sstable counts: 
                        0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
                      ------------------------------------------------------------------------------------------
              0.. 29 |  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
             30.. 59 |  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
             60.. 89 |  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
             90..119 |  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
            120..149 |  0  0  0  0  0  0  0  0  0  0  1  6  0  0  3  0  3  0  3  3  3  3  1  0  1  0  2  0  3  2
            150..179 |  3  3  3  0  0  3  3  0  3  2  3  1  3  3  3  3  0  0  0  3  0  1  1  0  6  3  3  0  3  3
            180..209 |  0  1  1  3  1  3  1  3  3  2  3  3  0  3  0  3  1  0  0  1  2  3  0  0  1  1  0  0  3  3
            210..239 |  3  3  3  2  0  6  1  3  0  0  3  3  3  1  3  4  3  3  3  0  3  0  3  1  2  2  0  2  0  0
            240..257 |  1  0  3  1  0  3  3  0  0  0  0  0  0  3  3  0  0
            Strategy=class org.apache.cassandra.db.compaction.RangeAwareCompactionStrategy, for 0 repaired sstables, boundary tokens=min(-9223372036854775808) -> max(-4095785201827646), location=/home/marcuse/c/d1
            Inner strategy: class org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy (257 instances, 0 total sstables)
              sstable counts: 
                        0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
                      ------------------------------------------------------------------------------------------
              0.. 29 |  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
             30.. 59 |  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
             60.. 89 |  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
             90..119 |  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
            120..149 |  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
            150..179 |  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
            180..209 |  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
            210..239 |  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
            240..257 |  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
            Strategy=class org.apache.cassandra.db.compaction.RangeAwareCompactionStrategy, for 0 repaired sstables, boundary tokens=max(-4095785201827646) -> max(9223372036854775807), location=/var/lib/c1
            Inner strategy: class org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy (257 instances, 0 total sstables)
              sstable counts: 
                        0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
                      ------------------------------------------------------------------------------------------
              0.. 29 |  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
             30.. 59 |  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
             60.. 89 |  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
             90..119 |  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
            120..149 |  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
            150..179 |  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
            180..209 |  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
            210..239 |  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
            240..257 |  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
            

          Comments/ideas/worries? Yuki Morishita, sankalp kohli, Aleksey Yeschenko, Jonathan Ellis, anyone?

          Show
          krummas Marcus Eriksson added a comment - Pushed a new branch here for early feedback, still needs cleanup and tests Enable like this: ALTER TABLE x.y WITH compaction={'class':'LeveledCompactionStrategy', 'range_aware_compaction':' true ', 'min_range_sstable_size_in_mb':'15'} Run a compaction strategy instance per owned range (with num_tokens=256 and rf=3, we will have 768 * 2 instances (repaired/unrepaired data)). To avoid getting very many tiny sstables in the per-range strategies, we keep them outside the strategy until the estimated size of a range-sstable is larger than 'min_range_sstable_size_in_mb' . ( estimation usually gets within a few % of the actual value). We do STCS among the many-range-sstables (called "L0" which might not be optimal due to LCS) We currently prioritize compaction in L0 to get sstables out of there as quickly as possible If an sstable fits within a range, it is added to that corresponding range-compaction strategy - this should avoid getting a lot of L0 sstables after streaming for example Adds a describecompactionstrategy nodetool command which displays information about the configured compaction strategy (like sstables per range etc). Example with only unrepaired data and 2 data directories - we first split the owned ranges over those 2 directories, and then we split on a per range basis, so the first RangeAwareCompactionStrategy is responsible for half the data and the second one is responsible for the rest: $ bin/nodetool describecompactionstrategy keyspace1 standard1 -------------------------------------------------- keyspace1.standard1 -------------------------------------------------- Strategy=class org.apache.cassandra.db.compaction.RangeAwareCompactionStrategy, for 167 unrepaired sstables, boundary tokens=min(-9223372036854775808) -> max(-4095785201827646), location=/home/marcuse/c/d1 Inner strategy: class org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy (257 instances, 162 total sstables) sstable counts: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 ------------------------------------------------------------------------------------------ 0.. 29 | 1 3 0 0 2 3 0 3 3 0 3 0 2 1 0 1 0 1 0 3 3 4 1 0 3 1 0 0 0 0 30.. 59 | 0 0 0 3 0 2 2 0 3 0 3 3 0 1 3 3 3 0 2 0 1 2 0 0 0 1 0 3 0 0 60.. 89 | 1 0 0 1 1 1 1 0 1 0 2 3 1 0 3 1 2 3 2 0 0 3 2 1 1 0 0 2 3 1 90..119 | 0 1 2 0 0 3 0 3 3 1 0 0 3 0 2 0 2 0 2 1 3 0 2 1 1 3 1 0 3 0 120..149 | 2 0 3 1 3 0 0 3 3 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 150..179 | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 180..209 | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 210..239 | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 240..257 | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Strategy=class org.apache.cassandra.db.compaction.RangeAwareCompactionStrategy, for 221 unrepaired sstables, boundary tokens=max(-4095785201827646) -> max(9223372036854775807), location=/ var /lib/c1 Inner strategy: class org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy (257 instances, 215 total sstables) sstable counts: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 ------------------------------------------------------------------------------------------ 0.. 29 | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 30.. 59 | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 60.. 89 | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 90..119 | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 120..149 | 0 0 0 0 0 0 0 0 0 0 1 6 0 0 3 0 3 0 3 3 3 3 1 0 1 0 2 0 3 2 150..179 | 3 3 3 0 0 3 3 0 3 2 3 1 3 3 3 3 0 0 0 3 0 1 1 0 6 3 3 0 3 3 180..209 | 0 1 1 3 1 3 1 3 3 2 3 3 0 3 0 3 1 0 0 1 2 3 0 0 1 1 0 0 3 3 210..239 | 3 3 3 2 0 6 1 3 0 0 3 3 3 1 3 4 3 3 3 0 3 0 3 1 2 2 0 2 0 0 240..257 | 1 0 3 1 0 3 3 0 0 0 0 0 0 3 3 0 0 Strategy=class org.apache.cassandra.db.compaction.RangeAwareCompactionStrategy, for 0 repaired sstables, boundary tokens=min(-9223372036854775808) -> max(-4095785201827646), location=/home/marcuse/c/d1 Inner strategy: class org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy (257 instances, 0 total sstables) sstable counts: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 ------------------------------------------------------------------------------------------ 0.. 29 | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 30.. 59 | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 60.. 89 | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 90..119 | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 120..149 | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 150..179 | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 180..209 | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 210..239 | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 240..257 | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Strategy=class org.apache.cassandra.db.compaction.RangeAwareCompactionStrategy, for 0 repaired sstables, boundary tokens=max(-4095785201827646) -> max(9223372036854775807), location=/ var /lib/c1 Inner strategy: class org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy (257 instances, 0 total sstables) sstable counts: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 ------------------------------------------------------------------------------------------ 0.. 29 | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 30.. 59 | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 60.. 89 | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 90..119 | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 120..149 | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 150..179 | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 180..209 | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 210..239 | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 240..257 | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Comments/ideas/worries? Yuki Morishita , sankalp kohli , Aleksey Yeschenko , Jonathan Ellis , anyone?
          Hide
          krummas Marcus Eriksson added a comment -

          setting to patch available as I think it is ready to get some feedback
          https://github.com/krummas/cassandra/commits/marcuse/rangeawarecompaction
          dtests: https://github.com/krummas/cassandra-dtest/commits/marcuse/10540
          dtests need the multi datadir ccm: https://github.com/krummas/ccm/commits/multi-data-dirs

          both the dtest and cassandra patches are on top of the patches in CASSANDRA-6696

          Also setting Philip Thompson as tester (feel free to reassign) - I need help with some big dataset tests here, just to see that switching a big dataset to use this does not explode - also testing general operational stuff like bootstrapping, decom etc with atleast 100+ gig nodes is needed.

          Show
          krummas Marcus Eriksson added a comment - setting to patch available as I think it is ready to get some feedback https://github.com/krummas/cassandra/commits/marcuse/rangeawarecompaction dtests: https://github.com/krummas/cassandra-dtest/commits/marcuse/10540 dtests need the multi datadir ccm: https://github.com/krummas/ccm/commits/multi-data-dirs both the dtest and cassandra patches are on top of the patches in CASSANDRA-6696 Also setting Philip Thompson as tester (feel free to reassign) - I need help with some big dataset tests here, just to see that switching a big dataset to use this does not explode - also testing general operational stuff like bootstrapping, decom etc with atleast 100+ gig nodes is needed.
          Hide
          krummas Marcus Eriksson added a comment -

          pushed a new commit with a fix and some timing output to see how long it takes getting new compaction tasks

          Show
          krummas Marcus Eriksson added a comment - pushed a new commit with a fix and some timing output to see how long it takes getting new compaction tasks
          Hide
          carlyeks Carl Yeksigian added a comment -

          Marcus Eriksson: can you rebase this and rerun the tests now that CASSANDRA-6696 is in?

          Show
          carlyeks Carl Yeksigian added a comment - Marcus Eriksson : can you rebase this and rerun the tests now that CASSANDRA-6696 is in?
          Hide
          krummas Marcus Eriksson added a comment -

          pushed a new branch

          dtest: http://cassci.datastax.com/job/krummas-marcuse-10540-dtest/
          utest: http://cassci.datastax.com/job/krummas-marcuse-10540-testall/

          I see that there are a few tests that seem to break, I'll look into those

          Show
          krummas Marcus Eriksson added a comment - pushed a new branch dtest: http://cassci.datastax.com/job/krummas-marcuse-10540-dtest/ utest: http://cassci.datastax.com/job/krummas-marcuse-10540-testall/ I see that there are a few tests that seem to break, I'll look into those
          Show
          krummas Marcus Eriksson added a comment - rebased and test runs looks clean: http://cassci.datastax.com/job/krummas-marcuse-10540-dtest/ http://cassci.datastax.com/job/krummas-marcuse-10540-testall/
          Hide
          jbellis Jonathan Ellis added a comment -

          While waiting for review, what kind of write amplification improvement (i.e. total bytes compacted given constant bytes loaded) are you seeing with STCS?

          Show
          jbellis Jonathan Ellis added a comment - While waiting for review, what kind of write amplification improvement (i.e. total bytes compacted given constant bytes loaded) are you seeing with STCS?
          Hide
          krummas Marcus Eriksson added a comment -

          Have not measured, will try to do that this week

          Show
          krummas Marcus Eriksson added a comment - Have not measured, will try to do that this week
          Hide
          krummas Marcus Eriksson added a comment -

          Planning to use CASSANDRA-11844 to write a bunch of stress tests for this, they should be finished before we consider committing this

          Show
          krummas Marcus Eriksson added a comment - Planning to use CASSANDRA-11844 to write a bunch of stress tests for this, they should be finished before we consider committing this
          Hide
          krummas Marcus Eriksson added a comment -

          These "benchmarks" have been run using cassandra-stress with this yaml (only modified per run with the different compaction configurations). cassandra-stress generates 40GB of data and then it compacts those sstables using 8 threads. All tests were run with 256 tokens on my machine (2 ssds, 32GB ram):

          ./tools/bin/compaction-stress write -d /var/lib/cassandra -d /home/marcuse/cassandra -g 40 -p blogpost-range.yaml -t 4 -v 256
          ./tools/bin/compaction-stress compact -d /var/lib/cassandra -d /home/marcuse/cassandra -p blogpost-range.yaml -t 8 -v 256
          

          First a base line - it takes about 7 minutes to compact 40GB of data with STCS, and we get a write amplification (compaction bytes written / size before) of about 1.46.

          • 40GB + STCS
            size before size after compaction bytes written time number of compactions
            42986704571 31305948786 62268272752 7:44 26
            43017694284 31717603488 62800073327 7:04 26
            42863193047 31244649872 64673778727 6:44 26
            42962733336 31842455113 62985984309 6:14 26
            43107421526 32526047125 61657717328 6:04 26

          With range aware compaction and a small min_range_sstable_size_in_mb we compact slower, about 2x the time, but the end result is smaller with a tiny bit smaller
          write amplification (1.44). The reason for the longer time is that we need to do a lot more tiny compaction for each vnode. The reason for the smaller size after the compactions is that we are much more likely to compact overlapping sstables together as we compact within each vnode.

          • 40GB + STCS + range_aware + min_range_sstable_size_in_mb: 1
            size before size after compaction bytes written time number of compactions
            42944940703 25352795435 61734295478 13:18 286
            42896304174 25830662102 62049066195 15:45 287
            43091495756 24811367911 61448601743 12:25 287
            42961529234 26275106863 63118850488 13:17 284
            42902111497 25749453764 61529524300 13:54 280

          As we increase the min_range_sstable_size_in_mb the time spent is reduced, the size after the compaction is increased and the number of compactions is reduced since we don't promote sstables to the per-vnode-strategies as quickly. With large enough min_range_sstable_size_in_mb the behaviour will be the same as STCS (+a small overhead for estimating the size of the next vnode range during compaction)

          • 40GB + STCS + range_aware + min_range_sstable_size_in_mb: 5
            size before size after compaction bytes written time number of compactions
            43071111106 27586259306 62855258024 10:35 172
          • 40GB + STCS + range_aware + min_range_sstable_size_in_mb: 10
            size before size after compaction bytes written time number of compactions
            42998501805 28281735688 65469323764 9:45 109
          • 40GB + STCS + range_aware + min_range_sstable_size_in_mb: 20
            size before size after compaction bytes written time number of compactions
            42801860659 28934194973 66554340039 10:05 48
          • 40GB + STCS + range_aware + min_range_sstable_size_in_mb: 50
            size before size after compaction bytes written time number of compactions
            42881416448 30352758950 61223610818 7:25 27

          With LCS and a small sstable_size_in_mb we get a huge difference with range aware due to the amount of compactions we need to do to get the leveling without range aware compaction. With range aware, we get fewer levels in each vnode-range and that is much quicker to compact. Write amplification is about 2.0 with range aware and 3.4 without.

          • 40GB + LCS + sstable_size_in_mb: 10 + range_aware + min_range_sstable_size_in_mb: 10
            size before size after compaction bytes written time number of compactions
            43170254812 26511935628 87637370434 19:55 903
            43015904097 26100197485 83125478305 14:45 854
            43188886684 25651102691 87520409116 19:55 920
          • 40GB + LCS + sstable_size_in_mb: 10
            size before size after compaction bytes written time number of compactions
            43099495889 23876144309 139000531662 28:25 3751
            42811000078 24620085107 147722973544 30:35 3909
            42879141849 24479485292 146194679395 30:46 3882

          If we bump the lcs sstable_size_in_mb to the default we get more similar results. Write amplification is smaller with range aware compaction but size after is also bigger. The reason for the bigger size after compaction has settled is that we run with a bigger min_range_sstable_size_in_mb which means more data will stay out of the per-range compaction strategies and this means it is only size tiered. This probably also explains the reduced write amplification - 2.0 with range aware and 2.3 without.

          • 40GB + LCS + sstable_size_in_mb: 160 + range_aware + min_range_sstable_size_in_mb: 20
            size before size after compaction bytes written time number of compactions
            42970784099 27044941599 85933586287 12:55 180
            42953512565 26229232777 82158863291 11:36 155
            43028281629 26025950993 86704157660 11:25 177
          • 40GB + LCS + sstable_size_in_mb: 160
            size before size after compaction bytes written time number of compactions
            43120992697 24487560567 100347633105 12:25 151
            42854926611 24466503628 102492898148 10:55 155
            42919253642 24831918330 100902215961 12:15 161
          Show
          krummas Marcus Eriksson added a comment - These "benchmarks" have been run using cassandra-stress with this yaml (only modified per run with the different compaction configurations). cassandra-stress generates 40GB of data and then it compacts those sstables using 8 threads. All tests were run with 256 tokens on my machine (2 ssds, 32GB ram): ./tools/bin/compaction-stress write -d / var /lib/cassandra -d /home/marcuse/cassandra -g 40 -p blogpost-range.yaml -t 4 -v 256 ./tools/bin/compaction-stress compact -d / var /lib/cassandra -d /home/marcuse/cassandra -p blogpost-range.yaml -t 8 -v 256 First a base line - it takes about 7 minutes to compact 40GB of data with STCS, and we get a write amplification (compaction bytes written / size before) of about 1.46. 40GB + STCS size before size after compaction bytes written time number of compactions 42986704571 31305948786 62268272752 7:44 26 43017694284 31717603488 62800073327 7:04 26 42863193047 31244649872 64673778727 6:44 26 42962733336 31842455113 62985984309 6:14 26 43107421526 32526047125 61657717328 6:04 26 With range aware compaction and a small min_range_sstable_size_in_mb we compact slower, about 2x the time, but the end result is smaller with a tiny bit smaller write amplification (1.44). The reason for the longer time is that we need to do a lot more tiny compaction for each vnode. The reason for the smaller size after the compactions is that we are much more likely to compact overlapping sstables together as we compact within each vnode. 40GB + STCS + range_aware + min_range_sstable_size_in_mb: 1 size before size after compaction bytes written time number of compactions 42944940703 25352795435 61734295478 13:18 286 42896304174 25830662102 62049066195 15:45 287 43091495756 24811367911 61448601743 12:25 287 42961529234 26275106863 63118850488 13:17 284 42902111497 25749453764 61529524300 13:54 280 As we increase the min_range_sstable_size_in_mb the time spent is reduced, the size after the compaction is increased and the number of compactions is reduced since we don't promote sstables to the per-vnode-strategies as quickly. With large enough min_range_sstable_size_in_mb the behaviour will be the same as STCS (+a small overhead for estimating the size of the next vnode range during compaction) 40GB + STCS + range_aware + min_range_sstable_size_in_mb: 5 size before size after compaction bytes written time number of compactions 43071111106 27586259306 62855258024 10:35 172 40GB + STCS + range_aware + min_range_sstable_size_in_mb: 10 size before size after compaction bytes written time number of compactions 42998501805 28281735688 65469323764 9:45 109 40GB + STCS + range_aware + min_range_sstable_size_in_mb: 20 size before size after compaction bytes written time number of compactions 42801860659 28934194973 66554340039 10:05 48 40GB + STCS + range_aware + min_range_sstable_size_in_mb: 50 size before size after compaction bytes written time number of compactions 42881416448 30352758950 61223610818 7:25 27 With LCS and a small sstable_size_in_mb we get a huge difference with range aware due to the amount of compactions we need to do to get the leveling without range aware compaction. With range aware, we get fewer levels in each vnode-range and that is much quicker to compact. Write amplification is about 2.0 with range aware and 3.4 without. 40GB + LCS + sstable_size_in_mb: 10 + range_aware + min_range_sstable_size_in_mb: 10 size before size after compaction bytes written time number of compactions 43170254812 26511935628 87637370434 19:55 903 43015904097 26100197485 83125478305 14:45 854 43188886684 25651102691 87520409116 19:55 920 40GB + LCS + sstable_size_in_mb: 10 size before size after compaction bytes written time number of compactions 43099495889 23876144309 139000531662 28:25 3751 42811000078 24620085107 147722973544 30:35 3909 42879141849 24479485292 146194679395 30:46 3882 If we bump the lcs sstable_size_in_mb to the default we get more similar results. Write amplification is smaller with range aware compaction but size after is also bigger. The reason for the bigger size after compaction has settled is that we run with a bigger min_range_sstable_size_in_mb which means more data will stay out of the per-range compaction strategies and this means it is only size tiered. This probably also explains the reduced write amplification - 2.0 with range aware and 2.3 without. 40GB + LCS + sstable_size_in_mb: 160 + range_aware + min_range_sstable_size_in_mb: 20 size before size after compaction bytes written time number of compactions 42970784099 27044941599 85933586287 12:55 180 42953512565 26229232777 82158863291 11:36 155 43028281629 26025950993 86704157660 11:25 177 40GB + LCS + sstable_size_in_mb: 160 size before size after compaction bytes written time number of compactions 43120992697 24487560567 100347633105 12:25 151 42854926611 24466503628 102492898148 10:55 155 42919253642 24831918330 100902215961 12:15 161
          Hide
          jjirsa Jeff Jirsa added a comment -

          Carl Yeksigian - this still on your plate to review?

          Show
          jjirsa Jeff Jirsa added a comment - Carl Yeksigian - this still on your plate to review?
          Hide
          carlyeks Carl Yeksigian added a comment -

          I'm +1 on the code here, I'm just waiting on some more testing from Philip Thompson.

          Thanks for the ping Jeff Jirsa.

          Show
          carlyeks Carl Yeksigian added a comment - I'm +1 on the code here, I'm just waiting on some more testing from Philip Thompson . Thanks for the ping Jeff Jirsa .
          Hide
          cam1982 Cameron Zemek added a comment -

          I have found one issue with the code. It states "To avoid getting very many tiny sstables in the per-range strategies, we keep them outside the strategy until the estimated size of a range-sstable is larger than 'min_range_sstable_size_in_mb'. (estimation usually gets within a few % of the actual value)."

          However RangeAwareCompactionStrategy::addSSTable does not check that the sstable meets the minimum size. This is potentially an issue with repairs that stream sections of sstables or if memtable only includes a single token range on flush.

          On a different note, I notice the performance testing so far as looked at write amplification. I suspect RangeAwareCompaction could also improve read performance due to partitions more likely to exist in less sstables (ie.. reduces the sstables per read). It would be interesting to see SSTable leaders for partitions with STCS vs RangeAwareCompaction + STCS. Can get list of sstable leaders with ic-pstats tool have open sourced here, https://github.com/instaclustr/cassandra-sstable-tools

          Show
          cam1982 Cameron Zemek added a comment - I have found one issue with the code. It states "To avoid getting very many tiny sstables in the per-range strategies, we keep them outside the strategy until the estimated size of a range-sstable is larger than 'min_range_sstable_size_in_mb'. (estimation usually gets within a few % of the actual value)." However RangeAwareCompactionStrategy::addSSTable does not check that the sstable meets the minimum size. This is potentially an issue with repairs that stream sections of sstables or if memtable only includes a single token range on flush. On a different note, I notice the performance testing so far as looked at write amplification. I suspect RangeAwareCompaction could also improve read performance due to partitions more likely to exist in less sstables (ie.. reduces the sstables per read). It would be interesting to see SSTable leaders for partitions with STCS vs RangeAwareCompaction + STCS. Can get list of sstable leaders with ic-pstats tool have open sourced here, https://github.com/instaclustr/cassandra-sstable-tools
          Hide
          krummas Marcus Eriksson added a comment -

          removing patch available - making a few changes (we should flush to separate sstables if we don't use vnodes - skip 'L0')

          Show
          krummas Marcus Eriksson added a comment - removing patch available - making a few changes (we should flush to separate sstables if we don't use vnodes - skip 'L0')

            People

            • Assignee:
              krummas Marcus Eriksson
              Reporter:
              krummas Marcus Eriksson
              Reviewer:
              Carl Yeksigian
            • Votes:
              6 Vote for this issue
              Watchers:
              26 Start watching this issue

              Dates

              • Created:
                Updated:

                Development