Branch here for a concept I think works, but I want some more opinions.
Sylvain Lebresne / Aleksey Yeschenko in particular, do you two feel this is safe?
Basically while we build the ColumnIndex, we have the list of List<IndexHelper.IndexInfo> columnsIndex in memory in a sorted list. At the time we call close(), we see if either: there are too many of those objects, OR, if they use too much memory (large keys). If so, we downsample the list, in place, by a constant factor.
In ascii art:
If we set a target size of 3, we'd downsample this by a factor of 3 into:
We'll keep the latest / last lastName+deletion marker on merge, and we'll keep earliest/first firstName + offset, and recalculate the width.
Not to be ignored: this has an added bonus of really making large partitions much less painful if you crank down those column index targets - little bit more IO, but much less GC. Running this in a test cluster, I was able to do reads/writes to 80GB partitions with 5GB heap, which is a pretty cool side effect.