Details
Description
We can consider an SSTable as a set of partition keys, and 'compaction' as deduplication of those partition keys.
We want to find compaction candidates from SSTables that have as many same keys as possible. If we can group similar SSTables based on some measurement, we can achieve more efficient compaction.
One such measurement is Jaccard Distance,
which we can estimate using technique called MinHash.
In Cassandra, we can calculate and store MinHash signature when writing SSTable. New compaction strategy uses the signature to find the group of similar SSTable for compaction candidates. We can always fall back to STCS when such candidates are not exists.
This is just an idea floating around my head, but before I forget, I dump it here. For introduction to this technique, Chapter 3 of 'Mining of Massive Datasets' is a good start.
Issue Links
 relates to

CASSANDRA6216 Level Compaction should persist last compacted key per level
 Open
This JIRA has two parts. We can split that work in two dependent JIRAs also.
1) We need store x min hashes of all the rows in an stable. I am not sure whether Murmur3 is a good candidate.
2) Find an stable in Level L such that all its overlapping stables in level L+1 have maximum Jaccard Distance or have similar rows.
How does this sound?