This is not so interesting for a "proper" solution maybe, but adding just for the reference.
I needed to get space for more data, so I recently just crashed into a quick compression hack for supercolumns.
I was considering to compress the index blocks as Jonathan suggested, but I could not make up my mind on how safe that would be in terms of other code accessing the data and had a bit short time, so I looked for something more isolated.
Final decision was to simply compress the serialized columns in a supercolumn (+ add a bit caching to avoid recompressing all the time when serialized size is requested)
The data I have, has supercolumns with typically 50-60 subcolumns. Mostly small strings or numbers.
In total, the subcolumns makes up 600-1200 bytes of data when serialized.
Usually a fair bit of supercolumns per row.
My test data was 447 keys. I tested with the ning lzf jars and the default java.util.zip.
This is not necessarily a good test, but I think json2sstable is somewhat useful to measure relative impact between implementations although not useful to determine real performance in any way.
In addition, I made a simple dictionary of column names (only applied to supercolumns) as the column names was not very well compressed when looking at just a single supercolumn at a time.
The result of both the digest and compression:
Standard cassandra. json2sstable:
As a reference, the whole sstable files compresses as follows:
ning.com (command line)
gzip (command line)
I doubt this implementation has much for inclusion in a release. Just added the numbers for the reference.
Of course, if requested, I could see if I could make the patch available somewhere.