Sorry for the late update.
Yes, we have a cluster with some 20-30 billion columns (maybe even closer to 40 billion by now) which implements a column name map and has been in production for about 2 years.
I was actually looking at committing this 2 years ago together with fairly large number of other changes which was implemented in the column/supercolumn serializer code but I never got around to implement a good way to push the sstable version numbers into the serializer to make things backwards compatible before focus moved resources elsewhere.
As mentioned above by others, while not benchmarked and proven, I had a very good feeling the total change helped quite a bit on GC issues, memtables and a bit on performance in general, but in terms of disk space, the benefit was somewhat limited after sstable compression was implemented as the repeating column names are compressed pretty well.
This is already 2 years ago (the cluster still runs by the way), but if memory serves me right:
30-40% reduction in disk space without compression
10% reduction on top of compression (I did a test after it was implemented).
In my case, the implementation is actually hardcoded due to time constraints.
A static map which is global for the entire cassandra installation.
If committing this into cassandra, I believe my plan was split in 3.
Possible as 3 different implementation stages:
1. A simple config option (as a config file or as a columnfamily) where users themselves can assign repeating column names. Sure, it is not as fancy as many other options, but maybe we could open up to cover some strange corner case usages here with things like substrings as well.
Think options to cover complex versions of patterns like date/times such as 20130701202020 where a large chunk of the column name repeats, but not all of it.
In the current implementation, if there is a mapping entry, it converts the string to a variable length integer which becomes the new column name. If there is no mapping entry, it stores the raw data.
In our case, we have <40 repeating column names so I never need more than a 1 byte varint.
I also modified the column format to add a "column feature bitmap" at the start of each column. This allowed me to turn on/off name/id mapping as well as things like TTL's and a handful of other meta data.
There is a bunch of 64 bit numbers in the column format which only have default value in 99.999% of all cases and very often your column value is just an 8 byte int, a boolean or a short text entry. That is, in most cases the column meta data is many times larger than the value stored.
This would have been my first implementation. Mostly because I have a working implementation of it already and the mapping table would be very easy to move to a config file with just a list of column names read at cassandra startup, or stored in a similar way to column family and other internal config (just as another keyspace for config). Unfortunately, it is a little bit work also to push such config data down to the serializer. At least as the code was organized 2 years ago.
Notice again, you do not need any sort of atomic handling of the updates to the map in any way in this implementation. You can add map entries at any time. The result after deserializing is always the same as column names can have a mix of raw and map id values thanks to the "column feature bitmap" that was introduced.
Entries that was stored as raw strings will eventually be replaced by ID's to the map as compaction clean things up.
2. Auto learning feature with mapping table per sstable.
This would be stage 2 of the implementation.
When starting to create a new SSTable, build a sampling of the most frequently occuring column names and gradually start mapping them to ID's.
Add the mapping table to the end of the SSTable or in a separate .map file (similar to index files) at the completion of sstable generation.
The initial id mapping could be further improved by maintaining a global map of column names. This "global map" would not be used for serialization/deserialization. It would be used to pre-populate the value for a sstable and would only be statistics to optimize things further by reducing the number of mapping variances between sstables and reducing the number of raw values getting stored a bit more.
The id map would still be local to each sstable in terms of storage, but having such statistics would allow you to dramatically reduce the size of a potentially shared id cache across sstables where a lot of mapping entries would be identical.
Some may feel that we would run out of memory quickly or use a lot of extra disk with maps per sstable, but I guess that we only really need to deal with the top few thousand entries in each sstable and this would not be a problem to keep in a idmap cache in terms of size.
This is really just the top X re-occuring column names or column name sub pattern
If you have more unique column entries that this in a sstable, this will probably not be the feature that will save the day anyway as the benefit per column entry will be quite small vs. the overhead and the entire feature should potentially disable itself automagically if there is no frequently repeating patterns.
3. I had some ideas for moving the mapping up from the serializer to allow things like streaming entries including id maps between nodes, but things do indeed quickly get ugly and I do not remember clearly how I had planned to do this.
The reason I isolated the mapping function to the serializer is that it looked incredibly messy to move this further "up" in the stack. Column sorts, range scans, lookukups...
Not fun at all and if the memtable is serialized anyway the memory consumption there and in disk cache is dramatically reduced.
Also... with a global static map here at startup time, I actually share the mapped strings across most columns in memory anyway as I believe they all become pointers to my static complied in map (again, this gets a lot more trivial to make work very well if this is a startup config, but yes a bit less user unfriendly)
I haven't looked at the cassandra code for way to long now.
Has it become easier to get to know sstable version numbers in the serializer class now?
I could maybe check if someone in the team here would like to take a stab at moving this to latest cassandra and commit it if the above implementation seems interesting.
Part of it should be really easy to port as long as we can get a bit more info into the serializer/deserializer.