Neither LCS nor STCS are good for data that becomes immutable over time. The most obvious case is for time-series data where the application can guarantee that out-of-order delivery (to Cassandra) of events can't take place more than N minutes/seconds/hours/days have elapsed after the real (wall time).
There are various approaches that could involve paying attention to the row keys (if they include a time component) and/or the column name (if they are TimeUUID or Integer based and are inherently time-ordered), but it might be sufficient to just look at the timestamp of the columns themselves.
A possible approach:
1) Define an optional max-out-of-order window on a per-CF basis.
2) Use normal (LCS or STCS) compaction strategy for any SSTables that include any columns younger than max-out-of-order-delivery).
3) Use alternate compaction strategy (will call it TWCS time window compaction strategy for now) for any SSTables that only contain columns older than max-out-of-order-delivery.
4) TWCS will only compact sstables containing data older than max-out-of-order-delivery.
5) TWCS will only perform compaction to reduce row fragmentation (if there is any by the time it gets to TWCS or to reduce the number of small sstables.
6) To minimize re-compaction in TWCS, it should aggresively try to compact as many small sstables as possible into one large sstable that would never have to get recompacted.
In the case of large datasets (e.g. 5TB per node) with LCS, there would be on the order of seven levels, and hence seven separate writes of the same data over time. With this approach, it should be possible to get about 3 compactions per column (2 in original compaction and one more once hitting TWCS) in most cases, cutting the write workload by a factor of two or more for high volume time-series applications.
Note that the only workaround I can currently suggest to minimize compaction for these workloads is to programatically shard your data across time-window ranges (e.g. new CF per week), but that pushes unnecessary writing and querying logic out to the user and is not as convenient nor flexible.
Also note that I am not convinced that the approach I've suggested above is the best/most general way to solve the problem, but it does appear to be a relatively easy one to implement.
- relates to
CASSANDRA-5685 time based compaction strategy