The latter is better if you have cold data that may become hot again... but it's confusing if you have a workload such that you can't keep up with all compaction, but you can keep up with hot sstable. (Compaction backlog stat becomes useless since we fall increasingly behind.)
The pending compactions stat is already pretty wonky, so I'm not sure we should prioritize keeping that sane.
Option 1 (don't compact cold sstables) seems dangerous as a first step compared to option 2, especially because it's hard to decide what is "cold". Prioritizing compaction of hotter sstables seems like the better first step.
When comparing hotness of sstables, I think a good measure is avg_reads_per_sec / number_of_keys rather than just avg_reads_per_sec so that large sstables aren't over-weighted. When I mention the hotness of a bucket of sstables below, I'm talking about the sum of the hotness measure across the individual sstables.
For prioritizing compaction of hotter sstables, it seems like there are a few levels this can operate at:
- Picking sstable members for compaction buckets
- Picking the most "interesting" bucket to submit to the compaction executor (currently the smallest sstables are considered the most interesting)
- At the compaction executor level, prioritizing tasks in the queue (the queue is not currently prioritized)
(1) seems like the most difficult point to make good decisions at. I can imagine a scheme like dropping members that are below 2 * stdev of the mean hotness for the bucket working decently, but some of the efficiency of compacting many sstables at once is lost, and some of the drops would be poor when there is little variance among the sstables.
(2) would probably work well by itself, although, as discussed below, sstable overlap is a better measure than hotness for this.
(3) requires (2) to be somewhat fair. Each table submits its hottest buckets for compaction, and the executor prioritizes the hottest buckets in the queue (regardless of which table they came from). There is a potential for starvation among colder tables when compaction falls behind, but that may be mitigated by a few things:
- If the compaction of the hotter sstables is very effective at merging rows, the hotness of future buckets for that table should be lower. Since the hotness of a bucket is the sum of its members, if four totally overlapping sstables are merged into one sstable, the hotness of the new sstable should be 1/4 of the hotness of the previous bucket. I'll point out that tracking how much overlap there is among sstables would be a much better measure than hotness for picking which compactions to prioritize; in the worst case here (no overlap), the hotness of the newly compacted sstable could be the same as the bucket it came from.
- If we were willing to discard cold items in the queue when hotter items came in and the queue was full, colder tables would eventually submit new tasks with more sstables in them (thus having greater hotness).
While I'm thinking about it, do we have any tickets or features in place to track sstable overlap (beyond average number of sstables hit per read at the table level)?