At first I really really liked it, but then I realized a problem that takes a way a little bit of it and now I'm not sure. Anyways, firstly what I like: Couple this with posix_fadvise()/DONTNEED on the sstable's being switched from, and one would not even have to have memory for both sets of sstables in order to remain hot in cases where you rely on a cf being mostly or completely in memory.
The posix_fadvise() (and munlock() if mlock():ed sstables come into the picture in the future) would presumably be done at some granularity higher than rows or calls would be much too frequent for performance purposes. But doing so every few tens of MB:s or something should be fine.
In addition, on the topic of rate limiting, fsync():ing would still be required for rate limiting purposes under some circumstances to avoid affecting read latencies too much (to avoid e.g. the OS pushing out more than fits in battery-backed cache on a RAID controller as a result of pushing data in bursts).
A big downside though is this: For workloads where performance is dependent on the warmness of the cache with respect to the active set, this way of doing it would still imply most of the negative effects of mass cache eviction. Any large cf with a significant warm-up period would be highly effected.
A possible way to categorize a cf might be:
(1) Very small cf; fits in RAM with lots of margin.
(2) Smallish, just barely fits in RAM.
(3) Large; a lot larger than RAM.
On the premise that we're discussing situations where cache warmth is relevant the following disposition of the above cf:s with respect to an incremental switch-over:
(1) Works, but doesn't matter much since it fits in RAM anyway (except for muliple such sstables, but then see (2))
(2) Here we improve significantly by allowing us to lower the constant factor of RAM required relative to domain data size.
(3) Doesn't work anyway due to eviction on writes.
So really, it seems to me that for situations where you need a reasonably high rate of compaction, it would only work very well in (2) which is sort of a special case sitting in the middle on a spectrum.
You do point out that slow compaction is a potential helper here, and I agree. Provided that compaction is sufficiently slow that the warm-up period of the node is similar or less to the time spent compacting, this would indeed work well even in the case of (3).
I would further suggest that if you are IOPS sensitive you probably have a strong desire to limit compaction rate to something reasonable anyway.
It's not clear to me whether the trade-offs would tend to land on the side of it working well in practice or not.
A reasonably realistic example of type (3) with concrete numbers (let me know if I'm taking a mis-step in the calculations):
(I am about to engage on some pretty speculative stuff that terminates with insufficient math skills on my part; you may want to just skip the remainder of this comment.)
Say you have a 500 GB CF, and 16 GB of page cache in the OS. Say you have a warm-up period of 30 minutes on a completely cold start before you're comfortable taking the load. Assume that you don't want more than a ~ 25% impact in terms of cold IOPS during a compaction, relative to the level of warmness you reach after your 30 minute warm-up on node start.
Eviction will tend to be random relative to the frequency/recency of access. So an instant eviction of some percentage of page cache should result in a proportional (by some factor) percentage of IOPS.
Assume that your workload is such that 90% of reads are served from cache. This should mean that the factor in question is 10. I.e., a 10% eviction should result in a 100% increase in IOPS.
Now, if over time cache hit rates increased linearly this would mean that a 25% target IOPS increase during compaction translates into a 2.5% maximum eviction rate over the 30 minute time window. But here is where we become dependent on the distribution of reads and unfortunately where my math skills fail me.
But at least in the worst possible (unrealistic) case, those 2.5% over 30 minutes translates, with a 16 GB page cache, into 400 MB/30 minutes. Compacting 500 GB would thus take 26 days. Of course this is utterly unrealistic but should be an upper bound. Anyone with more math skills want to chime in on the expected behavior given a long-tail distribution (for example) where 30 minutes translates into the 90% hit rate?