Cassandra
  1. Cassandra
  2. CASSANDRA-5519

Reduce index summary memory use for cold sstables

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Fix Version/s: 2.1 beta1
    • Component/s: Core
    • Labels:
      None
    1. downsample.py
      3 kB
      Tyler Hobbs
    2. 5519-v1.txt
      107 kB
      Tyler Hobbs

      Issue Links

        Activity

        Hide
        Jonathan Ellis added a comment -

        WFM. Committed!

        Show
        Jonathan Ellis added a comment - WFM. Committed!
        Hide
        Tyler Hobbs added a comment -

        releaseReference, which can be reverted back to trunk form since isReplaced == !isCompacted

        True

        close, which is only called by snapshot repair (and releaseReference) which will never do any index summary replacements

        We still need to have different behavior for the close() call by snapshot repair, as it needs to perform the full close even though isCompacted will be false. While we could add a parameter to close() or define a closeReplacedReader() method, it seems clearer and more future-proof to keep the isReplaced flag.

        Show
        Tyler Hobbs added a comment - releaseReference, which can be reverted back to trunk form since isReplaced == !isCompacted True close, which is only called by snapshot repair (and releaseReference) which will never do any index summary replacements We still need to have different behavior for the close() call by snapshot repair, as it needs to perform the full close even though isCompacted will be false. While we could add a parameter to close() or define a closeReplacedReader() method, it seems clearer and more future-proof to keep the isReplaced flag.
        Hide
        Jonathan Ellis added a comment -

        Hmm. I see two uses of isReplaced:

        1. releaseReference, which can be reverted back to trunk form since isReplaced == !isCompacted
        2. close, which is only called by snapshot repair (and releaseReference) which will never do any index summary replacements
        Show
        Jonathan Ellis added a comment - Hmm. I see two uses of isReplaced: releaseReference, which can be reverted back to trunk form since isReplaced == !isCompacted close, which is only called by snapshot repair (and releaseReference) which will never do any index summary replacements
        Hide
        Tyler Hobbs added a comment -

        I'm pretty sure we can get rid of the isReplaced flag now.

        We still need it in order to do the proper cleanup on the replaced SSTR once all references are released, unless I'm missing something.

        Show
        Tyler Hobbs added a comment - I'm pretty sure we can get rid of the isReplaced flag now. We still need it in order to do the proper cleanup on the replaced SSTR once all references are released, unless I'm missing something.
        Hide
        Jonathan Ellis added a comment -

        I'm pretty sure we can get rid of the isReplaced flag now.

        Show
        Jonathan Ellis added a comment - I'm pretty sure we can get rid of the isReplaced flag now.
        Hide
        Tyler Hobbs added a comment -

        Jonathan Ellis Attached patch 5519-v1.txt includes your suggested changes. (My CASSANDRA-5519 branch is still good, as well.)

        Show
        Tyler Hobbs added a comment - Jonathan Ellis Attached patch 5519-v1.txt includes your suggested changes. (My CASSANDRA-5519 branch is still good, as well.)
        Hide
        Jonathan Ellis added a comment -

        the on-disk Summary is replaced whenever it is resampled

        Good call; startup time is a big pain point for some people and we don't want to make that worse.

        Show
        Jonathan Ellis added a comment - the on-disk Summary is replaced whenever it is resampled Good call; startup time is a big pain point for some people and we don't want to make that worse.
        Hide
        Tyler Hobbs added a comment -

        I should also mention that if you want to test it out, I suggest setting logging to TRACE for o.a.c.io.sstable.IndexSummary manager, index_summary_capacity_in_mb to 1, and index_summary_resize_interval_in_minutes to 1. That should give you a good picture of what's going on.

        Show
        Tyler Hobbs added a comment - I should also mention that if you want to test it out, I suggest setting logging to TRACE for o.a.c.io.sstable.IndexSummary manager, index_summary_capacity_in_mb to 1, and index_summary_resize_interval_in_minutes to 1. That should give you a good picture of what's going on.
        Hide
        Tyler Hobbs added a comment -

        This should be good for a second round of reviewing. I opened a pull request against my own repo so that you can comment inline, if you'd like: https://github.com/thobbs/cassandra/pull/1

        Changes since the last review:

        • The entire SSTableReader is replaced instead of just the IndexSummary.
        • Space used by compacting SSTables is accounted for
        • Enough extra space is reserved to cover rebuilding the largest summary
        • In order to stay within the memory usage limit on startup, the on-disk Summary is replaced whenever it is resampled. I increased the threshold for downsampling to make this less frequent. The alternative would be to always keep the full summary on disk and have a somewhat more complicated startup procedure. I would appreciate your thoughts on this.
        Show
        Tyler Hobbs added a comment - This should be good for a second round of reviewing. I opened a pull request against my own repo so that you can comment inline, if you'd like: https://github.com/thobbs/cassandra/pull/1 Changes since the last review: The entire SSTableReader is replaced instead of just the IndexSummary. Space used by compacting SSTables is accounted for Enough extra space is reserved to cover rebuilding the largest summary In order to stay within the memory usage limit on startup, the on-disk Summary is replaced whenever it is resampled. I increased the threshold for downsampling to make this less frequent. The alternative would be to always keep the full summary on disk and have a somewhat more complicated startup procedure. I would appreciate your thoughts on this.
        Hide
        Tyler Hobbs added a comment -

        Created CASSANDRA-6379 for the index_interval changes.

        Show
        Tyler Hobbs added a comment - Created CASSANDRA-6379 for the index_interval changes.
        Hide
        Tyler Hobbs added a comment -

        WFM. What does that leave for this one?

        I still need to account for spaced used by compacting SSTables, and I'm putting it through some more thorough testing.

        Show
        Tyler Hobbs added a comment - WFM. What does that leave for this one? I still need to account for spaced used by compacting SSTables, and I'm putting it through some more thorough testing.
        Hide
        Jonathan Ellis added a comment -

        WFM. What does that leave for this one?

        Show
        Jonathan Ellis added a comment - WFM. What does that leave for this one?
        Hide
        Tyler Hobbs added a comment -

        Would it be alright to split the replacement of index_interval by max_index_interval and min_index_interval into another ticket just for sanity's sake? It looks like a lot of changes need to be done for that, and they're independent of the changes for this ticket.

        Show
        Tyler Hobbs added a comment - Would it be alright to split the replacement of index_interval by max_index_interval and min_index_interval into another ticket just for sanity's sake? It looks like a lot of changes need to be done for that, and they're independent of the changes for this ticket.
        Hide
        Tyler Hobbs added a comment -

        The cleanup looks good overall, thanks.

        Rather than expose MSL directly as a config option, how about changing index_interval to max_index_interval and adding a min_index_interval? We could compute (as close as possible) MSL from min_index_interval.

        That sounds good to me.

        (I don't think users will need to tune BSL. 128 lets us be accurate to with in 1% which seems totally reasonable to me.)

        Agreed.

        Show
        Tyler Hobbs added a comment - The cleanup looks good overall, thanks. Rather than expose MSL directly as a config option, how about changing index_interval to max_index_interval and adding a min_index_interval? We could compute (as close as possible) MSL from min_index_interval. That sounds good to me. (I don't think users will need to tune BSL. 128 lets us be accurate to with in 1% which seems totally reasonable to me.) Agreed.
        Hide
        Jonathan Ellis added a comment -

        Rather than expose MSL directly as a config option, how about changing index_interval to max_index_interval and adding a min_index_interval? We could compute (as close as possible) MSL from min_index_interval.

        (I don't think users will need to tune BSL. 128 lets us be accurate to with in 1% which seems totally reasonable to me.)

        Show
        Jonathan Ellis added a comment - Rather than expose MSL directly as a config option, how about changing index_interval to max_index_interval and adding a min_index_interval? We could compute (as close as possible) MSL from min_index_interval. (I don't think users will need to tune BSL. 128 lets us be accurate to with in 1% which seems totally reasonable to me.)
        Hide
        Jonathan Ellis added a comment -

        Pushed my cleanup to https://github.com/jbellis/cassandra/commits/5519.

        (Moved the ISM init to StorageService were we have some existing examples of similar.)

        Show
        Jonathan Ellis added a comment - Pushed my cleanup to https://github.com/jbellis/cassandra/commits/5519 . (Moved the ISM init to StorageService were we have some existing examples of similar.)
        Hide
        Tyler Hobbs added a comment -

        What is the relationship between BASE_SAMPLING_LEVEL and MIN_SAMPLING_LEVEL with indexInterval?

        BASE/MIN_SAMPLING_LEVEL are orthogonal to indexInterval. BASE_SAMPLING_LEVEL essentially sets the granularity at which you can down/upsample. MIN_SAMPLING_LEVEL sets a limit on how low you can downsample. (I'll note that we could potentially raise indexInterval alongside these changes in order to have more summary entries for hot sstables.)

        How many rows do we get for 5% of a 8GB heap?

        That gives us ~410 MiB to work with. If we assume the average key length is 8 bytes, each summary entry uses 20 bytes of space, giving us ~21 million summary entries.

        At full sampling, that's 21MM * 128 = 2.7 billion rows, assuming no overlap across sstables. At minimum sampling, that's ~11 billion rows.

        If the avg key size is 16 bytes, that drops to ~2 and ~8 billion rows.

        Isn't it a minor bug to just ignore compacting sstables? Suggest reducing memory pool to allocate to the uncompacting ones, by the amount allocated to the compacting ones.

        Good point, I agree.

        Could we just resample at compaction time instead of dealing with refcounting or locking? That probably gives up too much of the potential benefits.

        Yeah, that would probably be okay for small sstables that are compacted frequently, but the large sstables would be tuned poorly, and those make up the majority of the memory use.

        I think we could make it almost as elegant by using the datatracker replace mechanism originally for compaction, to build a new SSTR and swap it in w/o extra concurrency controls.

        That's a good idea; I think it would be fairly clean. I'll give that a shot.

        Is the idea behind touching it in DD to force the mbean to be loaded, or is there a circular dependency that breaks w/o that?

        Neither the IndexSummaryManager singleton nor the mbean are loaded without that. No other classes use the IndexSummaryManager,
        so the static fields are never initialized. (Just importing the classes doesn't seem to trigger the class loader.)

        Show
        Tyler Hobbs added a comment - What is the relationship between BASE_SAMPLING_LEVEL and MIN_SAMPLING_LEVEL with indexInterval? BASE/MIN_SAMPLING_LEVEL are orthogonal to indexInterval . BASE_SAMPLING_LEVEL essentially sets the granularity at which you can down/upsample. MIN_SAMPLING_LEVEL sets a limit on how low you can downsample. (I'll note that we could potentially raise indexInterval alongside these changes in order to have more summary entries for hot sstables.) How many rows do we get for 5% of a 8GB heap? That gives us ~410 MiB to work with. If we assume the average key length is 8 bytes, each summary entry uses 20 bytes of space, giving us ~21 million summary entries. At full sampling, that's 21MM * 128 = 2.7 billion rows, assuming no overlap across sstables. At minimum sampling, that's ~11 billion rows. If the avg key size is 16 bytes, that drops to ~2 and ~8 billion rows. Isn't it a minor bug to just ignore compacting sstables? Suggest reducing memory pool to allocate to the uncompacting ones, by the amount allocated to the compacting ones. Good point, I agree. Could we just resample at compaction time instead of dealing with refcounting or locking? That probably gives up too much of the potential benefits. Yeah, that would probably be okay for small sstables that are compacted frequently, but the large sstables would be tuned poorly, and those make up the majority of the memory use. I think we could make it almost as elegant by using the datatracker replace mechanism originally for compaction, to build a new SSTR and swap it in w/o extra concurrency controls. That's a good idea; I think it would be fairly clean. I'll give that a shot. Is the idea behind touching it in DD to force the mbean to be loaded, or is there a circular dependency that breaks w/o that? Neither the IndexSummaryManager singleton nor the mbean are loaded without that. No other classes use the IndexSummaryManager , so the static fields are never initialized. (Just importing the classes doesn't seem to trigger the class loader.)
        Hide
        Jonathan Ellis added a comment -

        What is the relationship between BASE_SAMPLING_LEVEL and MIN_SAMPLING_LEVEL with indexInterval?

        How many rows do we get for 5% of a 8GB heap?

        Isn't it a minor bug to just ignore compacting sstables? Suggest reducing memory pool to allocate to the uncompacting ones, by the amount allocated to the compacting ones.

        Could we just resample at compaction time instead of dealing with refcounting or locking? That probably gives up too much of the potential benefits. But I think we could make it almost as elegant by using the datatracker replace mechanism originally for compaction, to build a new SSTR and swap it in w/o extra concurrency controls.

        Is the idea behind touching it in DD to force the mbean to be loaded, or is there a circular dependency that breaks w/o that?

        Show
        Jonathan Ellis added a comment - What is the relationship between BASE_SAMPLING_LEVEL and MIN_SAMPLING_LEVEL with indexInterval? How many rows do we get for 5% of a 8GB heap? Isn't it a minor bug to just ignore compacting sstables? Suggest reducing memory pool to allocate to the uncompacting ones, by the amount allocated to the compacting ones. Could we just resample at compaction time instead of dealing with refcounting or locking? That probably gives up too much of the potential benefits. But I think we could make it almost as elegant by using the datatracker replace mechanism originally for compaction, to build a new SSTR and swap it in w/o extra concurrency controls. Is the idea behind touching it in DD to force the mbean to be loaded, or is there a circular dependency that breaks w/o that?
        Hide
        Tyler Hobbs added a comment -

        I need to put this through more thorough testing and benchmarking, but I think it's at a good point for a preliminary review: https://github.com/thobbs/cassandra/compare/CASSANDRA-5519

        A few comments/questions:

        • I went with best-effort for the memory pool (if all summaries don't fit in the allotted space even at the minimum sampling level, there's nothing we can do about it). The amount of memory used may also temporarily exceed the limit while building new summaries.
        • There are two new cassandra.yaml options: one for controlling the memory pool size and one for regulating how frequently summaries are resized. These can also be set through JMX. We could conceivably also make the down/upsample thresholds and the minimum sampling level configurable. All of these default values are just guesses.
        • I went with a reference counting strategy for free'ing the IndexSummary's Memory. This makes the API a bit unpleasant (mostly in SSTR), but it should have low overhead. A ReadWriteLock might also work well instead of this with a cleaner API; let me know if I should benchmark the two for comparison.
        • I'm triggering the IndexSummaryManager singleton's initialization in DatabaseDescriptor; this feels wrong, so I'm open to suggestions.
        Show
        Tyler Hobbs added a comment - I need to put this through more thorough testing and benchmarking, but I think it's at a good point for a preliminary review: https://github.com/thobbs/cassandra/compare/CASSANDRA-5519 A few comments/questions: I went with best-effort for the memory pool (if all summaries don't fit in the allotted space even at the minimum sampling level, there's nothing we can do about it). The amount of memory used may also temporarily exceed the limit while building new summaries. There are two new cassandra.yaml options: one for controlling the memory pool size and one for regulating how frequently summaries are resized. These can also be set through JMX. We could conceivably also make the down/upsample thresholds and the minimum sampling level configurable. All of these default values are just guesses. I went with a reference counting strategy for free'ing the IndexSummary's Memory. This makes the API a bit unpleasant (mostly in SSTR), but it should have low overhead. A ReadWriteLock might also work well instead of this with a cleaner API; let me know if I should benchmark the two for comparison. I'm triggering the IndexSummaryManager singleton's initialization in DatabaseDescriptor; this feels wrong, so I'm open to suggestions.
        Hide
        Tyler Hobbs added a comment -

        How do we want to handle the memory pool not being large enough to accommodate all of the index summaries (even after downsampling)? Just make it a best-effort?

        Show
        Tyler Hobbs added a comment - How do we want to handle the memory pool not being large enough to accommodate all of the index summaries (even after downsampling)? Just make it a best-effort?
        Hide
        Jonathan Ellis added a comment -

        LGTM

        Show
        Jonathan Ellis added a comment - LGTM
        Hide
        Tyler Hobbs added a comment -

        The attached downsample.py script demonstrates the downsampling algorithm. It's a touch complex, but it would be easy to precompute or cache the downsampling patterns if needed.

        An example run with an original index summary size of 16 and a "resolution" of 8, meaning each minimal downsample run will remove 1/8th of the original points. The top row is the original index summary and each row below that represents one downsampling run:

        ~ $ ./downsample.py 16 8
          0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
              1   2   3   4   5   6   7       9  10  11  12  13  14  15
              1   2   3       5   6   7       9  10  11      13  14  15
              1       3       5   6   7       9      11      13  14  15
              1       3       5       7       9      11      13      15
                      3       5       7              11      13      15
                      3               7              11              15
        
        Show
        Tyler Hobbs added a comment - The attached downsample.py script demonstrates the downsampling algorithm. It's a touch complex, but it would be easy to precompute or cache the downsampling patterns if needed. An example run with an original index summary size of 16 and a "resolution" of 8, meaning each minimal downsample run will remove 1/8th of the original points. The top row is the original index summary and each row below that represents one downsampling run: ~ $ ./downsample.py 16 8 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5 6 7 9 10 11 12 13 14 15 1 2 3 5 6 7 9 10 11 13 14 15 1 3 5 6 7 9 11 13 14 15 1 3 5 7 9 11 13 15 3 5 7 11 13 15 3 7 11 15
        Hide
        Jonathan Ellis added a comment -

        Sounds reasonable.

        Show
        Jonathan Ellis added a comment - Sounds reasonable.
        Hide
        Tyler Hobbs added a comment -

        We could define a fixed-size memory pool, similar to what we do for memtables or cache, and allocate it to the sstables proportional to their hotness.

        It would be hard to describe this in text, so here's my pythonic psuedocode for distributing the fixed-size memory pool:

        total_reads_per_sec = sum(sstable.reads_per_sec for sstable in sstables)
        sstables_to_downsample = set()
        leftover_entries = 0
        for sstable in sstables:
            allocated_space = total_space * (sstable.reads_per_sec / total_reads_per_sec)
            num_entries = total_space / (SPACE_PER_ENTRY)  # space per entry = token + position + overhead
            if (num_entries > sstable.max_index_summary_entries):
                sstable.num_index_summary_entries = max_index_summary_entries
                leftover_entries = num_entries - sstable.max_index_summary_entries
            else
                sstable.num_index_summary_entries = num_entries
                sstables_to_downsample.add(sstable)
        
        # distribute leftover_entries among sstables_to_downsample based on read rates
        # (this probably ends up looking like a recursive or iterative function)
        

        Maybe we only rebuild the ones that are X% off of where they should be to make it lighter-weight.

        That's a good idea. (I was thinking of using a step function.) Instead of "X% off of where they should be", I would more precisely phrase that as "X% away from their previous proportion".

        Or if we're downsampling by more than 2x then we can just resample what we already have in memory instead of rebuilding "correctly."

        If you down-sample with a particular pattern, you can always down-sample using just the in-memory points; only up-samples need to read from disk.

        I'm trying to generalize the down-sampling pattern, but the two main points are (assuming 1% granularity):

        • For every 1% you down-sample, the number of points to remove from the in-memory summary is equal to 1% of the original (on-disk) count
        • Each 1% down-sampling run starts at a different offset to evenly space the down-sampling

        For example, to down-sample from 100% to 99%, you would remove every hundredth point, starting from index 0. To down-sample from 99% to 98%, you would remove every 99th point, starting from index 50. To down-sample from 98% to 97%, you would remove every 98th point, starting from index 24 or 74, and so on.

        Show
        Tyler Hobbs added a comment - We could define a fixed-size memory pool, similar to what we do for memtables or cache, and allocate it to the sstables proportional to their hotness. It would be hard to describe this in text, so here's my pythonic psuedocode for distributing the fixed-size memory pool: total_reads_per_sec = sum(sstable.reads_per_sec for sstable in sstables) sstables_to_downsample = set() leftover_entries = 0 for sstable in sstables: allocated_space = total_space * (sstable.reads_per_sec / total_reads_per_sec) num_entries = total_space / (SPACE_PER_ENTRY) # space per entry = token + position + overhead if (num_entries > sstable.max_index_summary_entries): sstable.num_index_summary_entries = max_index_summary_entries leftover_entries = num_entries - sstable.max_index_summary_entries else sstable.num_index_summary_entries = num_entries sstables_to_downsample.add(sstable) # distribute leftover_entries among sstables_to_downsample based on read rates # (this probably ends up looking like a recursive or iterative function) Maybe we only rebuild the ones that are X% off of where they should be to make it lighter-weight. That's a good idea. (I was thinking of using a step function.) Instead of "X% off of where they should be", I would more precisely phrase that as "X% away from their previous proportion". Or if we're downsampling by more than 2x then we can just resample what we already have in memory instead of rebuilding "correctly." If you down-sample with a particular pattern, you can always down-sample using just the in-memory points; only up-samples need to read from disk. I'm trying to generalize the down-sampling pattern, but the two main points are (assuming 1% granularity): For every 1% you down-sample, the number of points to remove from the in-memory summary is equal to 1% of the original (on-disk) count Each 1% down-sampling run starts at a different offset to evenly space the down-sampling For example, to down-sample from 100% to 99%, you would remove every hundredth point, starting from index 0. To down-sample from 99% to 98%, you would remove every 99th point, starting from index 50. To down-sample from 98% to 97%, you would remove every 98th point, starting from index 24 or 74, and so on.
        Hide
        Jonathan Ellis added a comment -

        Good start, but it seems a little fragile to me if a bunch of sstables are suddenly warmed up.

        What about this?

        We could define a fixed-size memory pool, similar to what we do for memtables or cache, and allocate it to the sstables proportional to their hotness. Every 15 minutes (which seems like a lot, maybe hourly?) we recalculate and rebuild the summaries. Maybe we only rebuild the ones that are X% off of where they should be to make it lighter-weight. Or if we're downsampling by more than 2x then we can just resample what we already have in memory instead of rebuilding "correctly."

        Show
        Jonathan Ellis added a comment - Good start, but it seems a little fragile to me if a bunch of sstables are suddenly warmed up. What about this? We could define a fixed-size memory pool, similar to what we do for memtables or cache, and allocate it to the sstables proportional to their hotness. Every 15 minutes (which seems like a lot, maybe hourly?) we recalculate and rebuild the summaries. Maybe we only rebuild the ones that are X% off of where they should be to make it lighter-weight. Or if we're downsampling by more than 2x then we can just resample what we already have in memory instead of rebuilding "correctly."
        Hide
        Tyler Hobbs added a comment -

        An initial idea for the implementation:

        Based on the recent (last 15m?) read rate (reads/sec), periodically down-sample the summary for SSTables which fall below the mean rate. The down-sampling rate could use a sliding scale based on the ratio of the mean to that SSTable's rate. As a example basic implementation, keep X% of the samples, where X = max(25, min(100, 100 * (rate / mean_rate))), so the coldest SSTables keep only 25% of the samples in memory.

        Presenting a way for the user to tune this (other than a simple on/off) is a little trickier. Perhaps make the min (default 25%) adjustable? Or start down-sampling at a configurable point (the default is the mean)? Those could also be automatically adjusted based on memory pressure.

        Show
        Tyler Hobbs added a comment - An initial idea for the implementation: Based on the recent (last 15m?) read rate (reads/sec), periodically down-sample the summary for SSTables which fall below the mean rate. The down-sampling rate could use a sliding scale based on the ratio of the mean to that SSTable's rate. As a example basic implementation, keep X% of the samples, where X = max(25, min(100, 100 * (rate / mean_rate))) , so the coldest SSTables keep only 25% of the samples in memory. Presenting a way for the user to tune this (other than a simple on/off) is a little trickier. Perhaps make the min (default 25%) adjustable? Or start down-sampling at a configurable point (the default is the mean)? Those could also be automatically adjusted based on memory pressure.

          People

          • Assignee:
            Tyler Hobbs
            Reporter:
            Jonathan Ellis
            Reviewer:
            Jonathan Ellis
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development