Rebased most of Stu's latest. Changed getLiveSize to only add in waste from the allocator instead of double-counting the rest. Enabled MemoryMeter.omitSharedBufferOverhead, which is super untested.
CFS.getColumnFamily was getting passed an allocator but this doesn't actually do anything. (I removed the parameter.) Was this supposed to be used during counter reconcile somehow?
Passing allocator throughout the CF+SC+[Super|Counter|Deleted|Expiring]Column heirarchy is ugly and error-prone. (I found and fixed one error while rebasing, where a method taking an allocator parameter called the default addColumn, instead of the addColumn-with-allocator.) Perhaps moving allocator to AbstractColumnContainer could fix this?
Not thrilled with the current alternatives for moving slabs off-heap. Our options are to
- use allocateDirect with all the problems that relying on finalization brings (see:
CASSANDRA-2521), as well as requiring users to manually tune the JVM direct buffer ceiling (or face a flood of System.GC calls courtesy of allocateDirect when the ceiling is reached).
- use JNA + manual free, which will require doing reference counting for memtables the way we do for sstables post-
CASSANDRA-2521. Otherwise if a thread that had the memtable in its list of historical memtables to merge from tries to read, you segfault. (This is NOT the same as the JNA 179 segfaults, which are fixed in 3.3.0.)
- stick with on-heap slabs
I'd say off-heap slabs don't matter that much but it would make the promotion failure problems you saw go away completely.
I'm also not a big fan of slabbing everything in sight. Keys associated with memtables make sense (and is done in my rebase). Row key and column names during sstable build, I'm skeptical of – if your rows are small enough that they finish in before new -> old promotion, then it doesn't matter. And if they are so large they do not, then your rate of key allocation is glacial and again it shouldn't matter. But, if we WERE to slab these the right way to do it would be per-sstable not per IndexSummary.
There is no logical unit of slabbing for key cache, we shouldn't be doing that at all.
I have an alternative idea to reduce non-memtable fragmentation: Adding region recycling post-flush. Once you promoted a slab in old gen, it stays there, instead of being GC'd and replaced with a slab in new gen again.
(This would also mitigate the main downside of allocateDirect.)
We'd still probably want some kind of delayed release of slabs so write load spikes don't permanently chew up your entire heap.