Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Fix Version/s: 1.0.0
    • Component/s: None
    • Labels:
      None

      Description

      The memtable design practically actively fights Java's GC design. Todd Lipcon gave a good explanation over on HBASE-3455.

      1. 2252-v4.txt
        105 kB
        Jonathan Ellis
      2. 2252-v3.txt
        92 kB
        Jonathan Ellis
      3. merged-2252.tgz
        26 kB
        Stu Hood
      4. ASF.LICENSE.NOT.GRANTED--0002-add-off-heap-MemtableAllocator-support.txt
        8 kB
        Jonathan Ellis
      5. ASF.LICENSE.NOT.GRANTED--0001-add-MemtableAllocator.txt
        37 kB
        Jonathan Ellis

        Activity

        Hide
        Jonathan Ellis added a comment -

        Patches to add MemtableAllocator (based on Todd's MemStoreLAB) and off-heap allocation via FreeableMemory.

        I gave up on trying to do the allocation for this and interning during deserialize; the code got very tangled and felt fragile. Instead, we copy during Memtable.put, which feels right since there is no way to screw ourselves by forgetting to copy in a new code path. I think I'm willing to bite the extra copy for that.

        Show
        Jonathan Ellis added a comment - Patches to add MemtableAllocator (based on Todd's MemStoreLAB) and off-heap allocation via FreeableMemory. I gave up on trying to do the allocation for this and interning during deserialize; the code got very tangled and felt fragile. Instead, we copy during Memtable.put, which feels right since there is no way to screw ourselves by forgetting to copy in a new code path. I think I'm willing to bite the extra copy for that.
        Hide
        Todd Lipcon added a comment -

        Cool

        Show
        Todd Lipcon added a comment - Cool
        Hide
        Jonathan Ellis added a comment -

        improved handling of counterupdatecolumns and added special case for empty buffers

        Show
        Jonathan Ellis added a comment - improved handling of counterupdatecolumns and added special case for empty buffers
        Hide
        Jonathan Ellis added a comment -

        new patchset handles workloads w/ no overwrites as well.

        Stu points out that counter's reconcile() operation does allocation too. this does not handle that yet.

        Show
        Jonathan Ellis added a comment - new patchset handles workloads w/ no overwrites as well. Stu points out that counter's reconcile() operation does allocation too. this does not handle that yet.
        Hide
        Stu Hood added a comment -

        We've been developing a very similar patch in parallel: attached as 2252-alternate-v1.tgz.

        The key differences seem to be:

        • Original removes interning from deserialization, alternate encapsulates it
        • Alternate has support for allocating reconciled counter columns (the result of merging counters) in slabs
        • Original uses JNA for direct memory access, alternate uses ByteBuffer.allocateDirect
        • Alternate uses the allocation count of the Memtable's allocator to measure throughput for thresholds (less accurate, but necessary to track the true memory usage of counters)

        We've tested the alternate patch in a production setting with counter mutations to multiple column families: it reduced our fragmentation enough to double the time between promotion failures, but it does not fully eliminate fragmentation.

        Show
        Stu Hood added a comment - We've been developing a very similar patch in parallel: attached as 2252-alternate-v1.tgz. The key differences seem to be: Original removes interning from deserialization, alternate encapsulates it Alternate has support for allocating reconciled counter columns (the result of merging counters) in slabs Original uses JNA for direct memory access, alternate uses ByteBuffer.allocateDirect Alternate uses the allocation count of the Memtable's allocator to measure throughput for thresholds (less accurate, but necessary to track the true memory usage of counters) We've tested the alternate patch in a production setting with counter mutations to multiple column families: it reduced our fragmentation enough to double the time between promotion failures, but it does not fully eliminate fragmentation.
        Hide
        Jonathan Ellis added a comment -

        I'll look deeper later, but one quick comment:

        Original uses JNA for direct memory access, alternate uses ByteBuffer.allocateDirect

        allocateDirect pointers are not freed until CMS runs finalizers, so this could result in much larger space used than intended.

        The JNA-based code isn't perfect though either, I've seen heisenbug segfaults. I'm sure it's fixable though.

        Show
        Jonathan Ellis added a comment - I'll look deeper later, but one quick comment: Original uses JNA for direct memory access, alternate uses ByteBuffer.allocateDirect allocateDirect pointers are not freed until CMS runs finalizers, so this could result in much larger space used than intended. The JNA-based code isn't perfect though either, I've seen heisenbug segfaults. I'm sure it's fixable though.
        Hide
        Jonathan Ellis added a comment -

        The JNA-based code isn't perfect though either, I've seen heisenbug segfaults

        Described the problem at http://java.net/jira/browse/JNA-179 – we'll need a patch to JNA to make it possible to do what we want here. So for now we should stick to allocating on-heap which will at least give us the reduced fragmentation benefits.

        Show
        Jonathan Ellis added a comment - The JNA-based code isn't perfect though either, I've seen heisenbug segfaults Described the problem at http://java.net/jira/browse/JNA-179 – we'll need a patch to JNA to make it possible to do what we want here. So for now we should stick to allocating on-heap which will at least give us the reduced fragmentation benefits.
        Hide
        Stu Hood added a comment -

        Found a very obvious (and embarrassing) problem with the alternate patch... will post the fix tomorrow.

        Show
        Stu Hood added a comment - Found a very obvious (and embarrassing) problem with the alternate patch... will post the fix tomorrow.
        Hide
        Stu Hood added a comment -

        Uploading a v2 of the alternate patch... at some point late in development I switched all clone() calls to trim() which was actually avoiding copying the majority of the time.

        Show
        Stu Hood added a comment - Uploading a v2 of the alternate patch... at some point late in development I switched all clone() calls to trim() which was actually avoiding copying the majority of the time.
        Hide
        Stu Hood added a comment - - edited

        I'd like to go ahead and merge these patches together onto jbellis's version. Things that I will cherry pick from the alternate patch:

        • A generic Allocator interface, rather than referring to Memtables. There are various other places that we need to apply slabbing
        • The additional counter tests we added
        • Using the slab allocator's allocation count to determine throughput for memtables
        • Slabbing of keys

        The other realization we've had about slab allocation is that unless all sources of fragmentation are eliminated, slabbing actually causes promotion failures to happen earlier, since it is harder to promote a slab into a fragmented oldgen. The other sources of fragmentation we suspect are:

        • IndexSummaries (easily slabbed)
        • the key and row caches (row cache tackled in CASSANDRA-1969)

        These can probably be defragged in separate tickets, as long as we commit to fixing them.

        Show
        Stu Hood added a comment - - edited I'd like to go ahead and merge these patches together onto jbellis's version. Things that I will cherry pick from the alternate patch: A generic Allocator interface, rather than referring to Memtables. There are various other places that we need to apply slabbing The additional counter tests we added Using the slab allocator's allocation count to determine throughput for memtables Slabbing of keys The other realization we've had about slab allocation is that unless all sources of fragmentation are eliminated, slabbing actually causes promotion failures to happen earlier, since it is harder to promote a slab into a fragmented oldgen. The other sources of fragmentation we suspect are: IndexSummaries (easily slabbed) the key and row caches (row cache tackled in CASSANDRA-1969 ) These can probably be defragged in separate tickets, as long as we commit to fixing them.
        Hide
        Jonathan Ellis added a comment -

        sounds reasonable

        Show
        Jonathan Ellis added a comment - sounds reasonable
        Hide
        Stu Hood added a comment -

        It looks like MemtableAllocator was accidentally merged into trunk?

        Show
        Stu Hood added a comment - It looks like MemtableAllocator was accidentally merged into trunk?
        Hide
        Stu Hood added a comment -

        Attaching a merged version of the two patches, as described earlier.

        I also went ahead and slabbed the IndexSummaries, and added naive slabbing to the keycache: we'll need a longer term solution for slabbing caches, because of the overhead of internal fragmentation.

        The alternate patch has had a lot of burn in time on our clusters: combined with slabbing the IndexSummaries and disabling the keycache, we saw our first oldgen compaction after 76 hours, with an average of over a week before compaction.

        Show
        Stu Hood added a comment - Attaching a merged version of the two patches, as described earlier. I also went ahead and slabbed the IndexSummaries, and added naive slabbing to the keycache: we'll need a longer term solution for slabbing caches, because of the overhead of internal fragmentation. The alternate patch has had a lot of burn in time on our clusters: combined with slabbing the IndexSummaries and disabling the keycache, we saw our first oldgen compaction after 76 hours, with an average of over a week before compaction.
        Hide
        Stu Hood added a comment -

        Rebased the merged version for the counter UUID changes.

        Show
        Stu Hood added a comment - Rebased the merged version for the counter UUID changes.
        Hide
        Stu Hood added a comment -

        If it is possible to merge this, but to leave it disabled for the 0.8 release, we would be in a much happier place: it is a very painful patch to maintain.

        Show
        Stu Hood added a comment - If it is possible to merge this, but to leave it disabled for the 0.8 release, we would be in a much happier place: it is a very painful patch to maintain.
        Hide
        Stu Hood added a comment -

        This will need a rebase post-2006. In particular, we'll need to include the bytes allocated by the allocator in the live size, and make sure the shared buffers/slabs are excluded by JAMM.

        Show
        Stu Hood added a comment - This will need a rebase post-2006. In particular, we'll need to include the bytes allocated by the allocator in the live size, and make sure the shared buffers/slabs are excluded by JAMM.
        Hide
        Stu Hood added a comment -

        Rebased for trunk. I took the easy way out and just included the Allocator's allocation count in the memtable live size: this will overcount, but it accounts for fragmentation due to updates. I've started a patch for JAMM to add a mode that will allow us to ignore shared buffers: see jamm/buffer-behavior.

        Show
        Stu Hood added a comment - Rebased for trunk. I took the easy way out and just included the Allocator's allocation count in the memtable live size: this will overcount, but it accounts for fragmentation due to updates. I've started a patch for JAMM to add a mode that will allow us to ignore shared buffers: see jamm/buffer-behavior .
        Hide
        Jonathan Ellis added a comment -

        JNA 3.3.0 has been released including the http://java.net/jira/browse/JNA-179 fixes.

        Show
        Jonathan Ellis added a comment - JNA 3.3.0 has been released including the http://java.net/jira/browse/JNA-179 fixes.
        Hide
        Jonathan Ellis added a comment -

        Rebased most of Stu's latest. Changed getLiveSize to only add in waste from the allocator instead of double-counting the rest. Enabled MemoryMeter.omitSharedBufferOverhead, which is super untested.

        CFS.getColumnFamily was getting passed an allocator but this doesn't actually do anything. (I removed the parameter.) Was this supposed to be used during counter reconcile somehow?

        Passing allocator throughout the CF+SC+[Super|Counter|Deleted|Expiring]Column heirarchy is ugly and error-prone. (I found and fixed one error while rebasing, where a method taking an allocator parameter called the default addColumn, instead of the addColumn-with-allocator.) Perhaps moving allocator to AbstractColumnContainer could fix this?

        Not thrilled with the current alternatives for moving slabs off-heap. Our options are to

        • use allocateDirect with all the problems that relying on finalization brings (see: CASSANDRA-2521), as well as requiring users to manually tune the JVM direct buffer ceiling (or face a flood of System.GC calls courtesy of allocateDirect when the ceiling is reached).
        • use JNA + manual free, which will require doing reference counting for memtables the way we do for sstables post-CASSANDRA-2521. Otherwise if a thread that had the memtable in its list of historical memtables to merge from tries to read, you segfault. (This is NOT the same as the JNA 179 segfaults, which are fixed in 3.3.0.)
        • stick with on-heap slabs

        I'd say off-heap slabs don't matter that much but it would make the promotion failure problems you saw go away completely.

        I'm also not a big fan of slabbing everything in sight. Keys associated with memtables make sense (and is done in my rebase). Row key and column names during sstable build, I'm skeptical of – if your rows are small enough that they finish in before new -> old promotion, then it doesn't matter. And if they are so large they do not, then your rate of key allocation is glacial and again it shouldn't matter. But, if we WERE to slab these the right way to do it would be per-sstable not per IndexSummary.

        There is no logical unit of slabbing for key cache, we shouldn't be doing that at all.

        I have an alternative idea to reduce non-memtable fragmentation: Adding region recycling post-flush. Once you promoted a slab in old gen, it stays there, instead of being GC'd and replaced with a slab in new gen again.

        (This would also mitigate the main downside of allocateDirect.)

        We'd still probably want some kind of delayed release of slabs so write load spikes don't permanently chew up your entire heap.

        Show
        Jonathan Ellis added a comment - Rebased most of Stu's latest. Changed getLiveSize to only add in waste from the allocator instead of double-counting the rest. Enabled MemoryMeter.omitSharedBufferOverhead, which is super untested. CFS.getColumnFamily was getting passed an allocator but this doesn't actually do anything. (I removed the parameter.) Was this supposed to be used during counter reconcile somehow? Passing allocator throughout the CF+SC+ [Super|Counter|Deleted|Expiring] Column heirarchy is ugly and error-prone. (I found and fixed one error while rebasing, where a method taking an allocator parameter called the default addColumn, instead of the addColumn-with-allocator.) Perhaps moving allocator to AbstractColumnContainer could fix this? Not thrilled with the current alternatives for moving slabs off-heap. Our options are to use allocateDirect with all the problems that relying on finalization brings (see: CASSANDRA-2521 ), as well as requiring users to manually tune the JVM direct buffer ceiling (or face a flood of System.GC calls courtesy of allocateDirect when the ceiling is reached). use JNA + manual free, which will require doing reference counting for memtables the way we do for sstables post- CASSANDRA-2521 . Otherwise if a thread that had the memtable in its list of historical memtables to merge from tries to read, you segfault. (This is NOT the same as the JNA 179 segfaults, which are fixed in 3.3.0.) stick with on-heap slabs I'd say off-heap slabs don't matter that much but it would make the promotion failure problems you saw go away completely. I'm also not a big fan of slabbing everything in sight. Keys associated with memtables make sense (and is done in my rebase). Row key and column names during sstable build, I'm skeptical of – if your rows are small enough that they finish in before new -> old promotion, then it doesn't matter. And if they are so large they do not, then your rate of key allocation is glacial and again it shouldn't matter. But, if we WERE to slab these the right way to do it would be per-sstable not per IndexSummary. There is no logical unit of slabbing for key cache, we shouldn't be doing that at all. I have an alternative idea to reduce non-memtable fragmentation: Adding region recycling post-flush. Once you promoted a slab in old gen, it stays there, instead of being GC'd and replaced with a slab in new gen again. (This would also mitigate the main downside of allocateDirect.) We'd still probably want some kind of delayed release of slabs so write load spikes don't permanently chew up your entire heap.
        Hide
        Stu Hood added a comment -

        And if they are so large they do not, then your rate of key allocation is glacial and again it shouldn't matter.

        Compaction builds up an IndexSummary slowly enough that I theorized it might be causing fragmentation... didn't get a chance to prove it though.

        There is no logical unit of slabbing for key cache, we shouldn't be doing that at all.

        Agreed. We actually ended up disabling the key cache and saw a nice boost in time-to-promotion-failure, but I would love to find an actual solution.

        Once you promoted a slab in old gen, it stays there, instead of being GC'd and replaced with a slab in new gen again.

        The bookkeeping might be worth it, yes.

        Show
        Stu Hood added a comment - And if they are so large they do not, then your rate of key allocation is glacial and again it shouldn't matter. Compaction builds up an IndexSummary slowly enough that I theorized it might be causing fragmentation... didn't get a chance to prove it though. There is no logical unit of slabbing for key cache, we shouldn't be doing that at all. Agreed. We actually ended up disabling the key cache and saw a nice boost in time-to-promotion-failure, but I would love to find an actual solution. Once you promoted a slab in old gen, it stays there, instead of being GC'd and replaced with a slab in new gen again. The bookkeeping might be worth it, yes.
        Hide
        Jonathan Ellis added a comment -

        Adding region recycling post-flush

        Of course this has the same reference counting problem as JNA free does: we can't re-use the buffer until we're sure nobody's reading from it anymore.

        I'm inclined to just do in-heap buffers for the memtables to start with. Unlike cached rows, we're virtually guaranteed contention on the memtable objects if we try to do AtomicInteger-based reference counting, and sharded threadlocal counting is more of a pita than I want to tackle at the moment.

        I suspect that, like HBase, we'll realize substantial wins from area allocation even on-heap.

        Show
        Jonathan Ellis added a comment - Adding region recycling post-flush Of course this has the same reference counting problem as JNA free does: we can't re-use the buffer until we're sure nobody's reading from it anymore. I'm inclined to just do in-heap buffers for the memtables to start with. Unlike cached rows, we're virtually guaranteed contention on the memtable objects if we try to do AtomicInteger-based reference counting, and sharded threadlocal counting is more of a pita than I want to tackle at the moment. I suspect that, like HBase, we'll realize substantial wins from area allocation even on-heap.
        Hide
        Jonathan Ellis added a comment -

        rebased. tests pass, but could use a glance at the counters to make sure it's not missing an opportunity to use slab.

        otherwise, I think this is ready to go based on the above discussion.

        Show
        Jonathan Ellis added a comment - rebased. tests pass, but could use a glance at the counters to make sure it's not missing an opportunity to use slab. otherwise, I think this is ready to go based on the above discussion.
        Hide
        Jonathan Ellis added a comment -

        Also, I would like to know who the clown is that estimated this as 0.4 hours.

        Show
        Jonathan Ellis added a comment - Also, I would like to know who the clown is that estimated this as 0.4 hours.
        Hide
        Brandon Williams added a comment -

        Also, I would like to know who the clown is that estimated this as 0.4 hours.

        Probably some suit.

        Show
        Brandon Williams added a comment - Also, I would like to know who the clown is that estimated this as 0.4 hours. Probably some suit.
        Hide
        Stu Hood added a comment -

        +1

        • The SlabAllocator.waste method creates a local 'region' variable, but doesn't use it

        Other than that, this looks good. I won't have time to do any load testing this week, but I'll see if Chris is up for it.

        Show
        Stu Hood added a comment - +1 The SlabAllocator.waste method creates a local 'region' variable, but doesn't use it Other than that, this looks good. I won't have time to do any load testing this week, but I'll see if Chris is up for it.
        Hide
        Jonathan Ellis added a comment -

        The SlabAllocator.waste method creates a local 'region' variable, but doesn't use it

        changed waste() to size() and updated Jamm with an option to ignore the size of the backing array of a ByteBuffer. Then getLiveSize is the jamm-estimated size of the non-buffer components, plus the total slab sizes. (This is more accurate since it includes memory from a slab region that has been used, but has since become waste because of Column conflict resolution.)

        committed w/ above change.

        Show
        Jonathan Ellis added a comment - The SlabAllocator.waste method creates a local 'region' variable, but doesn't use it changed waste() to size() and updated Jamm with an option to ignore the size of the backing array of a ByteBuffer. Then getLiveSize is the jamm-estimated size of the non-buffer components, plus the total slab sizes. (This is more accurate since it includes memory from a slab region that has been used, but has since become waste because of Column conflict resolution.) committed w/ above change.
        Hide
        Hudson added a comment -

        Integrated in Cassandra #1033 (See https://builds.apache.org/job/Cassandra/1033/)
        Arena allocation for memtables
        patch by jbellis and stuhood for CASSANDRA-2252

        jbellis : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1158927
        Files :

        • /cassandra/trunk/test/unit/org/apache/cassandra/db/ReadMessageTest.java
        • /cassandra/trunk/test/unit/org/apache/cassandra/db/TimeSortTest.java
        • /cassandra/trunk/src/java/org/apache/cassandra/db/SuperColumn.java
        • /cassandra/trunk/src/java/org/apache/cassandra/io/sstable/SSTableSimpleUnsortedWriter.java
        • /cassandra/trunk/src/java/org/apache/cassandra/db/context/IContext.java
        • /cassandra/trunk/src/java/org/apache/cassandra/utils/SlabAllocator.java
        • /cassandra/trunk/src/java/org/apache/cassandra/db/IColumnContainer.java
        • /cassandra/trunk/lib/jamm-0.2.2.jar
        • /cassandra/trunk/test/unit/org/apache/cassandra/db/CounterMutationTest.java
        • /cassandra/trunk/lib/jamm-0.2.4.jar
        • /cassandra/trunk/src/java/org/apache/cassandra/db/ThreadSafeSortedColumns.java
        • /cassandra/trunk/src/java/org/apache/cassandra/db/index/keys/KeysSearcher.java
        • /cassandra/trunk/src/java/org/apache/cassandra/db/ColumnFamilyStore.java
        • /cassandra/trunk/test/unit/org/apache/cassandra/db/ColumnFamilyTest.java
        • /cassandra/trunk/src/java/org/apache/cassandra/db/ColumnFamily.java
        • /cassandra/trunk/src/java/org/apache/cassandra/db/RowMutation.java
        • /cassandra/trunk/src/java/org/apache/cassandra/db/ISortedColumns.java
        • /cassandra/trunk/src/java/org/apache/cassandra/io/sstable/SSTableIdentityIterator.java
        • /cassandra/trunk/src/java/org/apache/cassandra/db/Column.java
        • /cassandra/trunk/src/java/org/apache/cassandra/db/IColumn.java
        • /cassandra/trunk/test/unit/org/apache/cassandra/db/context/CounterContextTest.java
        • /cassandra/trunk/src/java/org/apache/cassandra/db/context/CounterContext.java
        • /cassandra/trunk/src/java/org/apache/cassandra/db/ExpiringColumn.java
        • /cassandra/trunk/src/java/org/apache/cassandra/db/marshal/CounterColumnType.java
        • /cassandra/trunk/src/java/org/apache/cassandra/db/CounterUpdateColumn.java
        • /cassandra/trunk/src/java/org/apache/cassandra/db/ColumnFamilySerializer.java
        • /cassandra/trunk/src/java/org/apache/cassandra/db/ColumnSerializer.java
        • /cassandra/trunk/src/java/org/apache/cassandra/db/CounterColumn.java
        • /cassandra/trunk/src/java/org/apache/cassandra/db/DeletedColumn.java
        • /cassandra/trunk/build.xml
        • /cassandra/trunk/src/java/org/apache/cassandra/db/ArrayBackedSortedColumns.java
        • /cassandra/trunk/test/unit/org/apache/cassandra/db/ArrayBackedSortedColumnsTest.java
        • /cassandra/trunk/src/java/org/apache/cassandra/db/AbstractColumnContainer.java
        • /cassandra/trunk/src/java/org/apache/cassandra/streaming/IncomingStreamReader.java
        • /cassandra/trunk/src/java/org/apache/cassandra/utils/Allocator.java
        • /cassandra/trunk/src/java/org/apache/cassandra/db/CounterMutation.java
        • /cassandra/trunk/src/java/org/apache/cassandra/db/Memtable.java
        • /cassandra/trunk/conf/cassandra-env.sh
        • /cassandra/trunk/src/java/org/apache/cassandra/io/util/ColumnSortedMap.java
        • /cassandra/trunk/src/java/org/apache/cassandra/service/StorageProxy.java
        • /cassandra/trunk/src/java/org/apache/cassandra/db/compaction/PrecompactedRow.java
        • /cassandra/trunk/src/java/org/apache/cassandra/utils/HeapAllocator.java
        • /cassandra/trunk/src/java/org/apache/cassandra/db/Row.java
        • /cassandra/trunk/src/java/org/apache/cassandra/io/IColumnSerializer.java
        • /cassandra/trunk/src/java/org/apache/cassandra/utils/ByteBufferUtil.java
        • /cassandra/trunk/test/unit/org/apache/cassandra/db/CounterColumnTest.java
        Show
        Hudson added a comment - Integrated in Cassandra #1033 (See https://builds.apache.org/job/Cassandra/1033/ ) Arena allocation for memtables patch by jbellis and stuhood for CASSANDRA-2252 jbellis : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1158927 Files : /cassandra/trunk/test/unit/org/apache/cassandra/db/ReadMessageTest.java /cassandra/trunk/test/unit/org/apache/cassandra/db/TimeSortTest.java /cassandra/trunk/src/java/org/apache/cassandra/db/SuperColumn.java /cassandra/trunk/src/java/org/apache/cassandra/io/sstable/SSTableSimpleUnsortedWriter.java /cassandra/trunk/src/java/org/apache/cassandra/db/context/IContext.java /cassandra/trunk/src/java/org/apache/cassandra/utils/SlabAllocator.java /cassandra/trunk/src/java/org/apache/cassandra/db/IColumnContainer.java /cassandra/trunk/lib/jamm-0.2.2.jar /cassandra/trunk/test/unit/org/apache/cassandra/db/CounterMutationTest.java /cassandra/trunk/lib/jamm-0.2.4.jar /cassandra/trunk/src/java/org/apache/cassandra/db/ThreadSafeSortedColumns.java /cassandra/trunk/src/java/org/apache/cassandra/db/index/keys/KeysSearcher.java /cassandra/trunk/src/java/org/apache/cassandra/db/ColumnFamilyStore.java /cassandra/trunk/test/unit/org/apache/cassandra/db/ColumnFamilyTest.java /cassandra/trunk/src/java/org/apache/cassandra/db/ColumnFamily.java /cassandra/trunk/src/java/org/apache/cassandra/db/RowMutation.java /cassandra/trunk/src/java/org/apache/cassandra/db/ISortedColumns.java /cassandra/trunk/src/java/org/apache/cassandra/io/sstable/SSTableIdentityIterator.java /cassandra/trunk/src/java/org/apache/cassandra/db/Column.java /cassandra/trunk/src/java/org/apache/cassandra/db/IColumn.java /cassandra/trunk/test/unit/org/apache/cassandra/db/context/CounterContextTest.java /cassandra/trunk/src/java/org/apache/cassandra/db/context/CounterContext.java /cassandra/trunk/src/java/org/apache/cassandra/db/ExpiringColumn.java /cassandra/trunk/src/java/org/apache/cassandra/db/marshal/CounterColumnType.java /cassandra/trunk/src/java/org/apache/cassandra/db/CounterUpdateColumn.java /cassandra/trunk/src/java/org/apache/cassandra/db/ColumnFamilySerializer.java /cassandra/trunk/src/java/org/apache/cassandra/db/ColumnSerializer.java /cassandra/trunk/src/java/org/apache/cassandra/db/CounterColumn.java /cassandra/trunk/src/java/org/apache/cassandra/db/DeletedColumn.java /cassandra/trunk/build.xml /cassandra/trunk/src/java/org/apache/cassandra/db/ArrayBackedSortedColumns.java /cassandra/trunk/test/unit/org/apache/cassandra/db/ArrayBackedSortedColumnsTest.java /cassandra/trunk/src/java/org/apache/cassandra/db/AbstractColumnContainer.java /cassandra/trunk/src/java/org/apache/cassandra/streaming/IncomingStreamReader.java /cassandra/trunk/src/java/org/apache/cassandra/utils/Allocator.java /cassandra/trunk/src/java/org/apache/cassandra/db/CounterMutation.java /cassandra/trunk/src/java/org/apache/cassandra/db/Memtable.java /cassandra/trunk/conf/cassandra-env.sh /cassandra/trunk/src/java/org/apache/cassandra/io/util/ColumnSortedMap.java /cassandra/trunk/src/java/org/apache/cassandra/service/StorageProxy.java /cassandra/trunk/src/java/org/apache/cassandra/db/compaction/PrecompactedRow.java /cassandra/trunk/src/java/org/apache/cassandra/utils/HeapAllocator.java /cassandra/trunk/src/java/org/apache/cassandra/db/Row.java /cassandra/trunk/src/java/org/apache/cassandra/io/IColumnSerializer.java /cassandra/trunk/src/java/org/apache/cassandra/utils/ByteBufferUtil.java /cassandra/trunk/test/unit/org/apache/cassandra/db/CounterColumnTest.java
        Hide
        Yang Yang added a comment - - edited

        SlabAllocator:

        private void tryRetireRegion(Region region)
        {
        if (currentRegion.compareAndSet(region, null))

        { filledRegions.add(region); }

        }

        could you please explain why we need to add them to "filledRegions"? when all the buffers that share the same region die/become unreachable, shouldn't we just let the region go and free memory? , then we should not tie this region in memory through the references starting from filledRegions . no ??

        just to confirm my thoughts, I looked at the HBase implementation:
        ./src/main/java/org/apache/hadoop/hbase/regionserver/MemStoreLAB.java

        /**

        • Try to retire the current chunk if it is still
        • <code>c</code>. Postcondition is that curChunk.get()
        • != c
          */
          private void tryRetireChunk(Chunk c) { @SuppressWarnings("unused") boolean weRetiredIt = curChunk.compareAndSet(c, null); // If the CAS succeeds, that means that we won the race // to retire the chunk. We could use this opportunity to // update metrics on external fragmentation. // // If the CAS fails, that means that someone else already // retired the chunk for us. }

        it does not tie it to a region list .

        the current result of tying regions together through the filledRegions is that all regions (even if those dead ones) always occupy memory. — well if the purpose is to count the size() held in allocator, should we just keep a int var of total size , or use weak references?

        Show
        Yang Yang added a comment - - edited SlabAllocator: private void tryRetireRegion(Region region) { if (currentRegion.compareAndSet(region, null)) { filledRegions.add(region); } } could you please explain why we need to add them to "filledRegions"? when all the buffers that share the same region die/become unreachable, shouldn't we just let the region go and free memory? , then we should not tie this region in memory through the references starting from filledRegions . no ?? just to confirm my thoughts, I looked at the HBase implementation: ./src/main/java/org/apache/hadoop/hbase/regionserver/MemStoreLAB.java /** Try to retire the current chunk if it is still <code>c</code>. Postcondition is that curChunk.get() != c */ private void tryRetireChunk(Chunk c) { @SuppressWarnings("unused") boolean weRetiredIt = curChunk.compareAndSet(c, null); // If the CAS succeeds, that means that we won the race // to retire the chunk. We could use this opportunity to // update metrics on external fragmentation. // // If the CAS fails, that means that someone else already // retired the chunk for us. } it does not tie it to a region list . the current result of tying regions together through the filledRegions is that all regions (even if those dead ones) always occupy memory. — well if the purpose is to count the size() held in allocator, should we just keep a int var of total size , or use weak references?
        Hide
        Jonathan Ellis added a comment -

        the purpose is we need to keep them alive until flush. so weak would not work.

        Show
        Jonathan Ellis added a comment - the purpose is we need to keep them alive until flush. so weak would not work.
        Hide
        Yang Yang added a comment -

        Thanks Jonathan,
        but why do we need them alive?
        for example

        I create a 2MB region, which is carved out to 100 ByteBuffers, each of these ByteBuffers would point to the "data" of the Region, so as long as one of them is live, the bytes pointed to by Region.data is still in heap; and if these 100 ByteBuffers all die, isn't it our goal to free the 2MB region, since no one is using them??

        Show
        Yang Yang added a comment - Thanks Jonathan, but why do we need them alive? for example I create a 2MB region, which is carved out to 100 ByteBuffers, each of these ByteBuffers would point to the "data" of the Region, so as long as one of them is live, the bytes pointed to by Region.data is still in heap; and if these 100 ByteBuffers all die, isn't it our goal to free the 2MB region, since no one is using them??
        Hide
        Jonathan Ellis added a comment -

        I see what you mean: you want to allow a Region to "die" if all the buffers it was responsible for have been overwritten and thus the Region is no longer needed for flush. In that case, weak references sound reasonable. However, I'm not sure it would make any difference in practice since we allocate row keys into the regions, as well.

        Show
        Jonathan Ellis added a comment - I see what you mean: you want to allow a Region to "die" if all the buffers it was responsible for have been overwritten and thus the Region is no longer needed for flush. In that case, weak references sound reasonable. However, I'm not sure it would make any difference in practice since we allocate row keys into the regions, as well.
        Hide
        Yang Yang added a comment - - edited

        my line of thought comes from update-heavy workload (counters etc). conceivably (putting the row key issue aside for a while) one region would contain bytebuffer values of similar age. as more updates come in, all the columns in older regions are likely to have all died out, thus allowing us to free the entire region before flushing happens.

        coming back to the row key issue, in the original slab allocator paper ( Jeff Bonwick ) , a slab contains strictly the same objects, which imply that they die at roughly the same time. if they don't, then yes, in our case, slab has the disadvantage that an entire slab (2MB worth of mem) is held simply because a row key in it is not dead yet. so to overcome this disadvantage, we probably need to further distinguish between object types to be allocated in the slab: this JIRA (same as HBase code) distinguishes between all the allocations between different memtables, to work better with update-heavy traffic, we need to distinguish between row keys and column values (they have different life times)

        Show
        Yang Yang added a comment - - edited my line of thought comes from update-heavy workload (counters etc). conceivably (putting the row key issue aside for a while) one region would contain bytebuffer values of similar age. as more updates come in, all the columns in older regions are likely to have all died out, thus allowing us to free the entire region before flushing happens. coming back to the row key issue, in the original slab allocator paper ( Jeff Bonwick ) , a slab contains strictly the same objects, which imply that they die at roughly the same time. if they don't, then yes, in our case, slab has the disadvantage that an entire slab (2MB worth of mem) is held simply because a row key in it is not dead yet. so to overcome this disadvantage, we probably need to further distinguish between object types to be allocated in the slab: this JIRA (same as HBase code) distinguishes between all the allocations between different memtables, to work better with update-heavy traffic, we need to distinguish between row keys and column values (they have different life times)
        Hide
        Yang Yang added a comment -

        actually even if we don't do the further optimization suggested in the last comment (separate rowkey and column value into different slab allocators), it would still very much likely work better and kill off some dead regions.

        let's say a row/column is continually updated 1000 times , and 100 column value fit into 2MB, then to do these 1000 updates, we allocate 10 regions, only the first region would contain the row key, and finally all the 8 regions in the middle would die, the first one remains due to the row key, and the last remains due to the latest (live) column value

        Show
        Yang Yang added a comment - actually even if we don't do the further optimization suggested in the last comment (separate rowkey and column value into different slab allocators), it would still very much likely work better and kill off some dead regions. let's say a row/column is continually updated 1000 times , and 100 column value fit into 2MB, then to do these 1000 updates, we allocate 10 regions, only the first region would contain the row key, and finally all the 8 regions in the middle would die, the first one remains due to the row key, and the last remains due to the latest (live) column value
        Hide
        Yang Yang added a comment - - edited

        hi Jonathan:

        I checked the counters code, they currently use HeapAllocator . what is the reason we don't yet use SlabAllocator for Counters?

        Thanks

        also I put the idea to use 2 SlabAllocators (one for those buffers with long life, one for those short-lived) in https://github.com/yangyangyyy/cassandra/commit/bc017835c64240e58c0c51b2d5f8793f3c7f3a76

        https://github.com/yangyangyyy/cassandra/commit/8431ca1b9586086073e6b81d346a06e8172a97e7
        maybe it is useful

        Show
        Yang Yang added a comment - - edited hi Jonathan: I checked the counters code, they currently use HeapAllocator . what is the reason we don't yet use SlabAllocator for Counters? Thanks also I put the idea to use 2 SlabAllocators (one for those buffers with long life, one for those short-lived) in https://github.com/yangyangyyy/cassandra/commit/bc017835c64240e58c0c51b2d5f8793f3c7f3a76 https://github.com/yangyangyyy/cassandra/commit/8431ca1b9586086073e6b81d346a06e8172a97e7 maybe it is useful
        Hide
        Jonathan Ellis added a comment -

        We only have a reference to the SA if it's a memtable-related operation, otherwise we have to use HA. Is that what you're referring to?

        Show
        Jonathan Ellis added a comment - We only have a reference to the SA if it's a memtable-related operation, otherwise we have to use HA. Is that what you're referring to?
        Hide
        Yang Yang added a comment - - edited

        "if it's a memtable-related operation" ---- but CounterContext.allocate produces a ByteBuffer , some of which goes into CounterColumn, hence Memtable, it seems.

        for example:

        CounterMutation.computeShardMerger() ==> CounterColumn.computeOldShardMerger()
        ===> ByteBuffer contextManager.computeOldShardMerger

        { ...... ContextState merger = ContextState.allocate(2, nbDelta, HeapAllocator.instance); .... return merger.context; }

        the "merger.context" is a ByteBuffer that is inserted into CounterColumn by CounterColumn.computeOldShardMerger()

        Thanks
        Yang

        Show
        Yang Yang added a comment - - edited "if it's a memtable-related operation" ---- but CounterContext.allocate produces a ByteBuffer , some of which goes into CounterColumn, hence Memtable, it seems. for example: CounterMutation.computeShardMerger() ==> CounterColumn.computeOldShardMerger() ===> ByteBuffer contextManager.computeOldShardMerger { ...... ContextState merger = ContextState.allocate(2, nbDelta, HeapAllocator.instance); .... return merger.context; } the "merger.context" is a ByteBuffer that is inserted into CounterColumn by CounterColumn.computeOldShardMerger() Thanks Yang
        Hide
        Jonathan Ellis added a comment -

        Happy to look at a patch to fix that. Please open a new ticket.

        Show
        Jonathan Ellis added a comment - Happy to look at a patch to fix that. Please open a new ticket.
        Hide
        Yang Yang added a comment -

        cool, actually after some thought, I think we need to put more care to utilizing SlabAllocator for counters:

        I realized this when u said "only if it's a memtable-related operation", this would be very true for some temp variable ByteBuffers, which are thrown away immediately, and hence get relaimed in the new gen GC, and never go into old gen.

        for counters, the column values (which contain the CounterContext) change a lot, if we assume that the value of each counter is updated 1000 times during the life time of a memtable before being flushed, then if you look at a typical 2MB slab allocated out, 99.9% of the buffers it contains are going to be non-reachable and GC'ed before flushing. so when the 0.1% buffer is promoted, it occupies 2MB space instead of its actual size, which would be more waste than the possible fragmentation problem it causes.

        so in this case (or, more generally, all cases where update is more often), using HeapAllocator may be better.

        Thanks
        Yang

        Show
        Yang Yang added a comment - cool, actually after some thought, I think we need to put more care to utilizing SlabAllocator for counters: I realized this when u said "only if it's a memtable-related operation", this would be very true for some temp variable ByteBuffers, which are thrown away immediately, and hence get relaimed in the new gen GC, and never go into old gen. for counters, the column values (which contain the CounterContext) change a lot, if we assume that the value of each counter is updated 1000 times during the life time of a memtable before being flushed, then if you look at a typical 2MB slab allocated out, 99.9% of the buffers it contains are going to be non-reachable and GC'ed before flushing. so when the 0.1% buffer is promoted, it occupies 2MB space instead of its actual size, which would be more waste than the possible fragmentation problem it causes. so in this case (or, more generally, all cases where update is more often), using HeapAllocator may be better. Thanks Yang

          People

          • Assignee:
            Jonathan Ellis
            Reporter:
            Jonathan Ellis
            Reviewer:
            Stu Hood
          • Votes:
            6 Vote for this issue
            Watchers:
            24 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development