Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Fix Version/s: 2.0.3
    • Component/s: Core
    • Labels:
      None

      Description

      I see two options:

      1. Don't compact cold sstables at all
      2. Compact cold sstables only if there is nothing more important to compact

      The latter is better if you have cold data that may become hot again... but it's confusing if you have a workload such that you can't keep up with all compaction, but you can keep up with hot sstable. (Compaction backlog stat becomes useless since we fall increasingly behind.)

      1. 6109-v3.patch
        26 kB
        Tyler Hobbs
      2. 6109-v2.patch
        21 kB
        Tyler Hobbs
      3. 6109-v1.patch
        7 kB
        Tyler Hobbs

        Issue Links

          Activity

          Hide
          Jonathan Ellis added a comment -

          Committed, disabled in 2.0.x and at 5% in 2.1.

          Bikeshedded option name to cold_reads_to_omit.

          Also added back generation to the first sstable sort to make it 100% deterministic.

          Show
          Jonathan Ellis added a comment - Committed, disabled in 2.0.x and at 5% in 2.1. Bikeshedded option name to cold_reads_to_omit . Also added back generation to the first sstable sort to make it 100% deterministic.
          Hide
          Tyler Hobbs added a comment -

          By the way, it might be a good idea to push this to 2.1 or to disable it by default in 2.0.x. It's a bit of a major change for a bugfix release this far into 2.0.x.

          Show
          Tyler Hobbs added a comment - By the way, it might be a good idea to push this to 2.1 or to disable it by default in 2.0.x. It's a bit of a major change for a bugfix release this far into 2.0.x.
          Hide
          Tyler Hobbs added a comment -

          6109-v3.patch (and branch) uses the new strategy for filtering cold sstables.

          Feel free to bikeshed on the config option name.

          Show
          Tyler Hobbs added a comment - 6109-v3.patch (and branch ) uses the new strategy for filtering cold sstables. Feel free to bikeshed on the config option name.
          Hide
          Jonathan Ellis added a comment -

          Makes sense.

          Show
          Jonathan Ellis added a comment - Makes sense.
          Hide
          Tyler Hobbs added a comment -

          When they make up more than X% do we stop discriminating or merge them only with other cold sstables?

          I was thinking we would stop discriminating. The logic would basically be this:

          total_reads = sum(sstable.reads_per_sec for sstable in sstables)
          total_cold_reads = 0
          cold_sstables = set()
          for sstable in sorted(sstables, key=lambda sstable: sstable.reads_per_key_per_sec):
              if (sstable.reads_per_sec + total_cold_reads) / total_reads < configurable_threshold:
                  cold_sstables.add(sstable)
                  total_cold_reads += sstable.reads_per_sec
              else:
                  break
          
          getBuckets(sstable for sstable in sstables if sstable not in cold_sstables)
          
          Show
          Tyler Hobbs added a comment - When they make up more than X% do we stop discriminating or merge them only with other cold sstables? I was thinking we would stop discriminating. The logic would basically be this: total_reads = sum(sstable.reads_per_sec for sstable in sstables) total_cold_reads = 0 cold_sstables = set() for sstable in sorted(sstables, key=lambda sstable: sstable.reads_per_key_per_sec): if (sstable.reads_per_sec + total_cold_reads) / total_reads < configurable_threshold: cold_sstables.add(sstable) total_cold_reads += sstable.reads_per_sec else: break getBuckets(sstable for sstable in sstables if sstable not in cold_sstables)
          Hide
          Jonathan Ellis added a comment -

          That sounds pretty straightforward.

          When they make up more than X% do we stop discriminating or merge them only with other cold sstables?

          Show
          Jonathan Ellis added a comment - That sounds pretty straightforward. When they make up more than X% do we stop discriminating or merge them only with other cold sstables?
          Hide
          Tyler Hobbs added a comment - - edited

          On second thought, filtering buckets (my misinterpretation of your statement) would be very conservative and would generally have little effect since there is no control over what sstables form buckets.

          Here's my latest proposal: ignore the coldest SSTables until they (collectively) make up more than X% of the total reads/sec.

          This is pretty easy to tune and it handles all of the corner cases we've considered. A conservative default might be 1 to 5%.

          Show
          Tyler Hobbs added a comment - - edited On second thought, filtering buckets (my misinterpretation of your statement) would be very conservative and would generally have little effect since there is no control over what sstables form buckets. Here's my latest proposal: ignore the coldest SSTables until they (collectively) make up more than X% of the total reads/sec. This is pretty easy to tune and it handles all of the corner cases we've considered. A conservative default might be 1 to 5%.
          Hide
          Tyler Hobbs added a comment - - edited

          No, I'm suggesting instead of getBuckets(sstables), getBuckets(sstable for sstable in sstables if recents_reads_from(sstable) > X)

          Ah, well that scheme has some problematic cases:

          • Many cold sstables that collectively make up a large percentage of reads may be ignored (like your 10, 1, 1, 1... case above)
          • It's possible to have no sstables that cross the threshold when they are equally hot
          Show
          Tyler Hobbs added a comment - - edited No, I'm suggesting instead of getBuckets(sstables), getBuckets(sstable for sstable in sstables if recents_reads_from(sstable) > X) Ah, well that scheme has some problematic cases: Many cold sstables that collectively make up a large percentage of reads may be ignored (like your 10, 1, 1, 1... case above) It's possible to have no sstables that cross the threshold when they are equally hot
          Hide
          Jonathan Ellis added a comment -

          you're suggesting ignoring buckets?

          No, I'm suggesting instead of getBuckets(sstables), getBuckets(sstable for sstable in sstables if recents_reads_from(sstable) > X)

          Show
          Jonathan Ellis added a comment - you're suggesting ignoring buckets? No, I'm suggesting instead of getBuckets(sstables) , getBuckets(sstable for sstable in sstables if recents_reads_from(sstable) > X)
          Hide
          Tyler Hobbs added a comment - - edited

          What if we just added a bucket filter that said, SSTables representing less than X% of the reads will not be bucketed?

          To be clear, you're suggesting ignoring buckets whose reads make up less than X% of the total reads/sec for the table, correct?

          Straightforward to tune and I can't think of any really pathological cases, other than where size-tiering just doesn't put hot overlapping sstables in the same bucket.

          This is definitely easier to tune.

          One case I'm concerned about is where the max compaction threshold prevents a bucket from ever being above X% of the total reads/sec, especially with new, small SSTables. If we compare reads per key per second instead of just reads/sec, that case goes away. Additionally, while comparing reads/sec would focus compactions on the largest SSTables, comparing reads per key per second would focus on compacting the hottest SSTables, which is an improvement. With that change, I really like this strategy.

          As far as the default threshold goes, I'll suggest a conservative 2 to 5%. Here's my thought process: there are usually roughly 5 tiers, so each tier should get about 20% of the total reads per key per second if all SSTables were equally hot. Cold sstables should have below 10 to 25% of the normal read rates, giving a 2 to 5% threshold.

          Show
          Tyler Hobbs added a comment - - edited What if we just added a bucket filter that said, SSTables representing less than X% of the reads will not be bucketed? To be clear, you're suggesting ignoring buckets whose reads make up less than X% of the total reads/sec for the table, correct? Straightforward to tune and I can't think of any really pathological cases, other than where size-tiering just doesn't put hot overlapping sstables in the same bucket. This is definitely easier to tune. One case I'm concerned about is where the max compaction threshold prevents a bucket from ever being above X% of the total reads/sec, especially with new, small SSTables. If we compare reads per key per second instead of just reads/sec, that case goes away. Additionally, while comparing reads/sec would focus compactions on the largest SSTables, comparing reads per key per second would focus on compacting the hottest SSTables, which is an improvement. With that change, I really like this strategy. As far as the default threshold goes, I'll suggest a conservative 2 to 5%. Here's my thought process: there are usually roughly 5 tiers, so each tier should get about 20% of the total reads per key per second if all SSTables were equally hot. Cold sstables should have below 10 to 25% of the normal read rates, giving a 2 to 5% threshold.
          Hide
          Jonathan Ellis added a comment -

          What if we just added a bucket filter that said, SSTables representing less than X% of the reads will not be bucketed? Straightforward to tune and I can't think of any really pathological cases, other than where size-tiering just doesn't put hot overlapping sstables in the same bucket. (Which I think is out of scope to solve here – we need cardinality estimation to fix that.)

          Show
          Jonathan Ellis added a comment - What if we just added a bucket filter that said, SSTables representing less than X% of the reads will not be bucketed? Straightforward to tune and I can't think of any really pathological cases, other than where size-tiering just doesn't put hot overlapping sstables in the same bucket. (Which I think is out of scope to solve here – we need cardinality estimation to fix that.)
          Hide
          Tyler Hobbs added a comment -

          I've spent some more time thinking about this and it seems like we either need a more sophisticated approach in order to handle the various corner cases or we need to disable this feature by default.

          If we disable the feature by default, then using a hotness percentile or something similar might be okay.

          If we want to enable the feature by default, I've got a couple of more sophisticated approaches:

          The first approach is fairly simple and uses two parameters:

          • SSTables which receive less than X% of the reads/sec per key of the hottest sstable (for the whole CF) will be considered cold.
          • If the cold sstables make up more than Y% of the total reads/sec, don't consider the warmest of the cold sstables cold. (In other words, go through the "cold" bucket and remove the warmest sstables until the cold bucket makes up less than %Y of the total reads/sec.)

          This solves one problem of basing coldness on the mean rate, which is that if you have almost all cold sstables, the mean will be very low. Comparing against the max deals well with this. The second parameter acts as a hedge for the case you brought up where a large number of cold sstables can collectively account for a high percentage of the total reads.

          The second approach is less hacky but more difficult to explain or tune; it's an bucket optimization measure that covers these concerns. Ideally, we would optimize two things:

          • Average sstable hotness of the bucket
          • The percentage of the total CF reads that are included in the bucket

          These two items are somewhat in opposition. Optimizing only for the first measure would mean just compacting the two hottest sstables. Optimizing only for the second would mean compacting all sstables. We can combine the two measures with different weightings to get a pretty good bucket optimization measure. I've played around with some different measures in python and have a script that makes approximately the same bucket choices I would. However, as I mentioned, this would be pretty hard for operators to understand and tune intelligently, somewhat like phi_convict_threshold. If you're still open to that, I can attach my script with some example runs.

          Show
          Tyler Hobbs added a comment - I've spent some more time thinking about this and it seems like we either need a more sophisticated approach in order to handle the various corner cases or we need to disable this feature by default. If we disable the feature by default, then using a hotness percentile or something similar might be okay. If we want to enable the feature by default, I've got a couple of more sophisticated approaches: The first approach is fairly simple and uses two parameters: SSTables which receive less than X% of the reads/sec per key of the hottest sstable (for the whole CF) will be considered cold. If the cold sstables make up more than Y% of the total reads/sec, don't consider the warmest of the cold sstables cold. (In other words, go through the "cold" bucket and remove the warmest sstables until the cold bucket makes up less than %Y of the total reads/sec.) This solves one problem of basing coldness on the mean rate, which is that if you have almost all cold sstables, the mean will be very low. Comparing against the max deals well with this. The second parameter acts as a hedge for the case you brought up where a large number of cold sstables can collectively account for a high percentage of the total reads. The second approach is less hacky but more difficult to explain or tune; it's an bucket optimization measure that covers these concerns. Ideally, we would optimize two things: Average sstable hotness of the bucket The percentage of the total CF reads that are included in the bucket These two items are somewhat in opposition. Optimizing only for the first measure would mean just compacting the two hottest sstables. Optimizing only for the second would mean compacting all sstables. We can combine the two measures with different weightings to get a pretty good bucket optimization measure. I've played around with some different measures in python and have a script that makes approximately the same bucket choices I would. However, as I mentioned, this would be pretty hard for operators to understand and tune intelligently, somewhat like phi_convict_threshold. If you're still open to that, I can attach my script with some example runs.
          Hide
          Tyler Hobbs added a comment -

          Suppose for instance that I have 11 sstables, one of which has 10M reads recently and 10 of which have 1M reads. If I set my threshold to 25% then nothing gets compacted which is probably not what we want, since the 10 "cold" sstables collectively represent 50% of the read activity.

          Actually, in this case none of the sstables would be considered cold (assuming they all have similar key estimates). The mean reads would be 1.8M, and 0.25 * 1.8M = 0.45M.

          I agree that it might be difficult to tune intelligently, though.

          analyze hotness globally (per-CF) rather than per-bucket

          That seems reasonable to me.

          configure the threshold based on hotness percentile (compact me if I am hotter than N% of my peers)

          This has the problem of always ignoring the coldest sstable even when there is little variation between them. So if you have four SSTables with 1M, 1M, 1M, and 0.999M reads, the last will be considered cold and never compacted.

          Show
          Tyler Hobbs added a comment - Suppose for instance that I have 11 sstables, one of which has 10M reads recently and 10 of which have 1M reads. If I set my threshold to 25% then nothing gets compacted which is probably not what we want, since the 10 "cold" sstables collectively represent 50% of the read activity. Actually, in this case none of the sstables would be considered cold (assuming they all have similar key estimates). The mean reads would be 1.8M, and 0.25 * 1.8M = 0.45M. I agree that it might be difficult to tune intelligently, though. analyze hotness globally (per-CF) rather than per-bucket That seems reasonable to me. configure the threshold based on hotness percentile (compact me if I am hotter than N% of my peers) This has the problem of always ignoring the coldest sstable even when there is little variation between them. So if you have four SSTables with 1M, 1M, 1M, and 0.999M reads, the last will be considered cold and never compacted.
          Hide
          Jonathan Ellis added a comment -

          I'm thinking about how I tune this as an operator. If we're going by coldness-relative-to-mean, I'm not really sure where to set that to achieve my read performance goals other than trial and error.

          Suppose for instance that I have 11 sstables, one of which has 10M reads recently and 10 of which have 1M reads. If I set my threshold to 25% then nothing gets compacted which is probably not what we want, since the 10 "cold" sstables collectively represent 50% of the read activity.

          What if instead we

          1. analyze hotness globally (per-CF) rather than per-bucket, and
          2. configure the threshold based on hotness percentile (compact me if I am hotter than N% of my peers)
          Show
          Jonathan Ellis added a comment - I'm thinking about how I tune this as an operator. If we're going by coldness-relative-to-mean, I'm not really sure where to set that to achieve my read performance goals other than trial and error. Suppose for instance that I have 11 sstables, one of which has 10M reads recently and 10 of which have 1M reads. If I set my threshold to 25% then nothing gets compacted which is probably not what we want, since the 10 "cold" sstables collectively represent 50% of the read activity. What if instead we analyze hotness globally (per-CF) rather than per-bucket, and configure the threshold based on hotness percentile (compact me if I am hotter than N% of my peers)
          Hide
          Tyler Hobbs added a comment -

          +1 on your changes

          Show
          Tyler Hobbs added a comment - +1 on your changes
          Hide
          Jonathan Ellis added a comment -

          Sorry, I had to bikeshed just a little more. Looks like it's cleaner to restrict prepBuckets (renamed) to restricting by max sstables + coldness, and let mostInteresting take care of dropping too-small buckets. This means we'll do a small amount of unnecessary sorting-by-coldness but this is negligible compared to the actual compaction we're setting up.

          Pushed to https://github.com/jbellis/cassandra/commits/6109

          Show
          Jonathan Ellis added a comment - Sorry, I had to bikeshed just a little more. Looks like it's cleaner to restrict prepBuckets (renamed) to restricting by max sstables + coldness, and let mostInteresting take care of dropping too-small buckets. This means we'll do a small amount of unnecessary sorting-by-coldness but this is negligible compared to the actual compaction we're setting up. Pushed to https://github.com/jbellis/cassandra/commits/6109
          Hide
          Tyler Hobbs added a comment -

          6109-v2.patch (and branch) adds a 'coldness_threshold' option for STCS and adds unit tests (fixing a couple of bugs that the tests exposed).

          Show
          Tyler Hobbs added a comment - 6109-v2.patch (and branch ) adds a 'coldness_threshold' option for STCS and adds unit tests (fixing a couple of bugs that the tests exposed).
          Hide
          Jonathan Ellis added a comment -

          Looks reasonable.

          • Can you make the actual drop-from-bucket a strategy option?
          • Can you add a test that exercises prepBuckets to STCSTest, a la the getBuckets test?
          Show
          Jonathan Ellis added a comment - Looks reasonable. Can you make the actual drop-from-bucket a strategy option? Can you add a test that exercises prepBuckets to STCSTest, a la the getBuckets test?
          Hide
          Tyler Hobbs added a comment -

          6109-v1.patch (and branch) picks the hottest sstables for buckets, ignores relatively cold sstables, and then picks the hottest overall bucket. I agree about scope creep, so no changes were made to the compaction manager queue.

          Show
          Tyler Hobbs added a comment - 6109-v1.patch (and branch ) picks the hottest sstables for buckets, ignores relatively cold sstables, and then picks the hottest overall bucket. I agree about scope creep, so no changes were made to the compaction manager queue.
          Hide
          Jonathan Ellis added a comment -

          Ah, right.

          (Since earlier tasks in the queue, or subsequent flushes, may cause us to re-evaluate the sstables we'd like to compact.)

          Feels to me like we're feature creeping a bit. I'd say let's focus this one on what to do within a CF, and we can open another to prioritize across CFs instead of FIFO.

          Show
          Jonathan Ellis added a comment - Ah, right. (Since earlier tasks in the queue, or subsequent flushes, may cause us to re-evaluate the sstables we'd like to compact.) Feels to me like we're feature creeping a bit. I'd say let's focus this one on what to do within a CF, and we can open another to prioritize across CFs instead of FIFO.
          Hide
          Tyler Hobbs added a comment -

          We're mostly talking about prioritizing across different CFs, right?

          Correct.

          What if we just made the compaction manager queue a priority queue instead of FIFO?

          Yeah, that's what I'm trying to do, it's just that with the current behavior, we can't determine the priority when inserting the tasks into the queue because the sstables aren't picked until tasks are removed from the queue.

          Show
          Tyler Hobbs added a comment - We're mostly talking about prioritizing across different CFs, right? Correct. What if we just made the compaction manager queue a priority queue instead of FIFO? Yeah, that's what I'm trying to do, it's just that with the current behavior, we can't determine the priority when inserting the tasks into the queue because the sstables aren't picked until tasks are removed from the queue.
          Hide
          Jonathan Ellis added a comment -

          We're mostly talking about prioritizing across different CFs, right?

          What if we just made the compaction manager queue a priority queue instead of FIFO?

          Show
          Jonathan Ellis added a comment - We're mostly talking about prioritizing across different CFs, right? What if we just made the compaction manager queue a priority queue instead of FIFO?
          Hide
          Tyler Hobbs added a comment -

          There's a problem with prioritizing the compaction manager queue: for normal compactions, we enqueue a task that will actually pick the sstables to compact at the last moment, right before the task is run. I think we have three options:

          1. Don't try to prioritize the compaction manager queue
          2. Pick the sstables upfront (maybe only for STCS and not LCS? This behavior was added for CASSANDRA-4310, which is primarily concerned with LCS) and potentially compact a less-than-optimal set of sstables
          3. Prioritize the task when it's submitted by picking an initial bucket of sstables; finalize the bucket, adding sstables if necessary, just before the task is executed

          I would lean towards #3, although it's the most complex. I just wanted to hear your thoughts before writing that up.

          Show
          Tyler Hobbs added a comment - There's a problem with prioritizing the compaction manager queue: for normal compactions, we enqueue a task that will actually pick the sstables to compact at the last moment, right before the task is run. I think we have three options: Don't try to prioritize the compaction manager queue Pick the sstables upfront (maybe only for STCS and not LCS? This behavior was added for CASSANDRA-4310 , which is primarily concerned with LCS) and potentially compact a less-than-optimal set of sstables Prioritize the task when it's submitted by picking an initial bucket of sstables; finalize the bucket, adding sstables if necessary, just before the task is executed I would lean towards #3, although it's the most complex. I just wanted to hear your thoughts before writing that up.
          Hide
          Jonathan Ellis added a comment -

          SGTM.

          Show
          Jonathan Ellis added a comment - SGTM.
          Hide
          Tyler Hobbs added a comment -

          I think I have some clearer ideas about how to do this now. We should be able to combine hotness and overlap concerns at the different levels.

          At level (1), avoid compacting comparatively cold data by dropping sstables from buckets when their hotness is less than, say, 25% of the bucket average (this avoids the low-variance problem of using the stddev). If the bucket falls below the min compaction threshold, ignore it (to make sure we're compacting enough sstables at once).

          At level (2), submit the hottest bucket to the executor for compaction.

          The average number of sstables hit per-read is actually a decent measure for prioritizing compactions at the executor level. At level (3), we can combine that with the bucket hotness to get a rough idea of how many individual sstable reads per second we could save by compacting a given bucket (hotness * avg_sstables_per_read). Prioritize compaction tasks in the queue based on this measure.

          That should give us a nice balance of not compacting cold data and prioritizing compaction of the most read and most fragmented sstables.

          Show
          Tyler Hobbs added a comment - I think I have some clearer ideas about how to do this now. We should be able to combine hotness and overlap concerns at the different levels. At level (1), avoid compacting comparatively cold data by dropping sstables from buckets when their hotness is less than, say, 25% of the bucket average (this avoids the low-variance problem of using the stddev). If the bucket falls below the min compaction threshold, ignore it (to make sure we're compacting enough sstables at once). At level (2), submit the hottest bucket to the executor for compaction. The average number of sstables hit per-read is actually a decent measure for prioritizing compactions at the executor level. At level (3), we can combine that with the bucket hotness to get a rough idea of how many individual sstable reads per second we could save by compacting a given bucket (hotness * avg_sstables_per_read). Prioritize compaction tasks in the queue based on this measure. That should give us a nice balance of not compacting cold data and prioritizing compaction of the most read and most fragmented sstables.
          Hide
          Jonathan Ellis added a comment - - edited

          I guess whether hotness or overlap is a more important criterion depends on your goal:

          1. prioritizing by hotness helps speed reads up more, especially when you have a lot of cold data sitting around
          2. prioritizing by overlap ratio reduces disk space and helps throw away obsolete cells faster

          I was hoping to tackle #1 here, but maybe that needs a separate strategy a la CASSANDRA-5561.

          For #2, CASSANDRA-5906 adds a HyperLogLog component that does a fantastic job of letting us estimate overlap ratios.

          Show
          Jonathan Ellis added a comment - - edited I guess whether hotness or overlap is a more important criterion depends on your goal: prioritizing by hotness helps speed reads up more, especially when you have a lot of cold data sitting around prioritizing by overlap ratio reduces disk space and helps throw away obsolete cells faster I was hoping to tackle #1 here, but maybe that needs a separate strategy a la CASSANDRA-5561 . For #2, CASSANDRA-5906 adds a HyperLogLog component that does a fantastic job of letting us estimate overlap ratios.
          Hide
          Tyler Hobbs added a comment -

          The latter is better if you have cold data that may become hot again... but it's confusing if you have a workload such that you can't keep up with all compaction, but you can keep up with hot sstable. (Compaction backlog stat becomes useless since we fall increasingly behind.)

          The pending compactions stat is already pretty wonky, so I'm not sure we should prioritize keeping that sane.

          Option 1 (don't compact cold sstables) seems dangerous as a first step compared to option 2, especially because it's hard to decide what is "cold". Prioritizing compaction of hotter sstables seems like the better first step.

          When comparing hotness of sstables, I think a good measure is avg_reads_per_sec / number_of_keys rather than just avg_reads_per_sec so that large sstables aren't over-weighted. When I mention the hotness of a bucket of sstables below, I'm talking about the sum of the hotness measure across the individual sstables.

          For prioritizing compaction of hotter sstables, it seems like there are a few levels this can operate at:

          1. Picking sstable members for compaction buckets
          2. Picking the most "interesting" bucket to submit to the compaction executor (currently the smallest sstables are considered the most interesting)
          3. At the compaction executor level, prioritizing tasks in the queue (the queue is not currently prioritized)

          (1) seems like the most difficult point to make good decisions at. I can imagine a scheme like dropping members that are below 2 * stdev of the mean hotness for the bucket working decently, but some of the efficiency of compacting many sstables at once is lost, and some of the drops would be poor when there is little variance among the sstables.

          (2) would probably work well by itself, although, as discussed below, sstable overlap is a better measure than hotness for this.

          (3) requires (2) to be somewhat fair. Each table submits its hottest buckets for compaction, and the executor prioritizes the hottest buckets in the queue (regardless of which table they came from). There is a potential for starvation among colder tables when compaction falls behind, but that may be mitigated by a few things:

          • If the compaction of the hotter sstables is very effective at merging rows, the hotness of future buckets for that table should be lower. Since the hotness of a bucket is the sum of its members, if four totally overlapping sstables are merged into one sstable, the hotness of the new sstable should be 1/4 of the hotness of the previous bucket. I'll point out that tracking how much overlap there is among sstables would be a much better measure than hotness for picking which compactions to prioritize; in the worst case here (no overlap), the hotness of the newly compacted sstable could be the same as the bucket it came from.
          • If we were willing to discard cold items in the queue when hotter items came in and the queue was full, colder tables would eventually submit new tasks with more sstables in them (thus having greater hotness).

          While I'm thinking about it, do we have any tickets or features in place to track sstable overlap (beyond average number of sstables hit per read at the table level)?

          Show
          Tyler Hobbs added a comment - The latter is better if you have cold data that may become hot again... but it's confusing if you have a workload such that you can't keep up with all compaction, but you can keep up with hot sstable. (Compaction backlog stat becomes useless since we fall increasingly behind.) The pending compactions stat is already pretty wonky, so I'm not sure we should prioritize keeping that sane. Option 1 (don't compact cold sstables) seems dangerous as a first step compared to option 2, especially because it's hard to decide what is "cold". Prioritizing compaction of hotter sstables seems like the better first step. When comparing hotness of sstables, I think a good measure is avg_reads_per_sec / number_of_keys rather than just avg_reads_per_sec so that large sstables aren't over-weighted. When I mention the hotness of a bucket of sstables below, I'm talking about the sum of the hotness measure across the individual sstables. For prioritizing compaction of hotter sstables, it seems like there are a few levels this can operate at: Picking sstable members for compaction buckets Picking the most "interesting" bucket to submit to the compaction executor (currently the smallest sstables are considered the most interesting) At the compaction executor level, prioritizing tasks in the queue (the queue is not currently prioritized) (1) seems like the most difficult point to make good decisions at. I can imagine a scheme like dropping members that are below 2 * stdev of the mean hotness for the bucket working decently, but some of the efficiency of compacting many sstables at once is lost, and some of the drops would be poor when there is little variance among the sstables. (2) would probably work well by itself, although, as discussed below, sstable overlap is a better measure than hotness for this. (3) requires (2) to be somewhat fair. Each table submits its hottest buckets for compaction, and the executor prioritizes the hottest buckets in the queue (regardless of which table they came from). There is a potential for starvation among colder tables when compaction falls behind, but that may be mitigated by a few things: If the compaction of the hotter sstables is very effective at merging rows, the hotness of future buckets for that table should be lower. Since the hotness of a bucket is the sum of its members, if four totally overlapping sstables are merged into one sstable, the hotness of the new sstable should be 1/4 of the hotness of the previous bucket. I'll point out that tracking how much overlap there is among sstables would be a much better measure than hotness for picking which compactions to prioritize; in the worst case here (no overlap), the hotness of the newly compacted sstable could be the same as the bucket it came from. If we were willing to discard cold items in the queue when hotter items came in and the queue was full, colder tables would eventually submit new tasks with more sstables in them (thus having greater hotness). While I'm thinking about it, do we have any tickets or features in place to track sstable overlap (beyond average number of sstables hit per read at the table level)?

            People

            • Assignee:
              Tyler Hobbs
              Reporter:
              Jonathan Ellis
              Reviewer:
              Jonathan Ellis
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development