Indexes are the real problem we're going to have to deal with here.
We can't write the indexes first, if we can't merge the columns we're indexing in memory. (Not without making two passes: one to scan all the column names while writing the indexes, and another to do the full merge. Two passes is too high a cost to pay.)
But we can't merge the columns in a streaming fashion while keeping the index data in memory to spit out at the end, either. We just fixed a bug from taking exactly this approach in
CASSANDRA-208: this would limit the number of columns we support to a relatively small number; probably low millions, depending on your column name size and how much memory you can throw at the jvm.
I think a hybrid approach is called for. If there are less than some threshold of columns (1000? 100000?) we merge in memory and put the index first, as we do now. Otherwise, we do a streaming merge and write the index to a separate file, similar to how we write the key index now. (In fact we could probably encapsulate this code as SSTableIndexWriter and use it in both places.)
We don't want to always index in separate file because (a) filesystems have limits too – we don't want one index file per row per columnfamily – and because we want to do streaming writes wherever possible, which means staying in the same file.
This approach will result in a litlte more seeking (between column index and sstable) than the two-pass inline approach, but merging in a single pass is worth the trade. (Remember that for large rows, reading the input multiple sstables will not be seek-free either once buffers max out. So we want to keep to a single pass for performance as well as simplicity.)