I have seen several use cases lately that lead me to agree that we should consider other compaction strategies. Some of the factors you might want to optimize by a compaction strategy are:
1. Number of blocks read concurrently for a single query
2. Number of times a key/value pair is written to disk
3. Total number of files stored in HDFS
4. Efficiency of deleting data
Some of the additional use cases I've seen that would lead to different optimal compaction algorithms are:
1. Time-series data and log data that is stored in roughly temporal order. In these cases, once a record is written its "neighborhood" (things that sort close by) is not updated. We can't help factor 1 by compacting frequently, since the ranges of files generated by minor compaction are mostly distinct.
2. Use of one locality group at a time. This could be done to add features to existing rows as the result of a ML process or something like it. With our current strategy, we are compacting files together that have completely distinct locality groups. This doesn't help with factors 1 and 4, and hurts factor 2.
3. Inverted indexing and graph storage with an expiration date or age-off. I think this is part of the use case Eric refers to. In this case, data is written in essentially random order, but is deleted in temporal order. We could get tricky and optimize factor 4 at some cost to factors 1, 2, and 3.
4. Document-partitioned indexing with really big tablets. In this case, we end up relying more on the log-structured merge tree to sort data than the bucket sorting that comes with organic tablet splits. Non-uniform updates across the tablet space could be optimized by having multiple files output by the big major compactions, such that the files' ranges are non-overlapping. Basically, when we do a major compaction to include lots of small files in a narrower range than the whole tablet we don't want to have to rewrite the data from the entire tablet. This potential optimization is augmented by frequent updates, deletions, and aggregation in a sub-range of a tablet.