HBase
  1. HBase
  2. HBASE-5347

GC free memory management in Level-1 Block Cache

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Later
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      On eviction of a block from the block-cache, instead of waiting for the garbage collecter to reuse its memory, reuse the block right away.

      This will require us to keep reference counts on the HFile blocks. Once we have the reference counts in place we can do our own simple blocks-out-of-slab allocation for the block-cache.

      This will help us with

      • reducing gc pressure, especially in the old generation
      • making it possible to have non-java-heap memory backing the HFile blocks

        Activity

        Hide
        stack added a comment -

        Very nice idea (if the reference counting can be figured).

        Show
        stack added a comment - Very nice idea (if the reference counting can be figured).
        Hide
        stack added a comment -

        Blocks are not all of the same size. Will this be an issue? Blocks of an awkward size – say a block that happen to have a massive KeyValue in them and they exceed massively the default block size – would need to be treated differently?

        Show
        stack added a comment - Blocks are not all of the same size. Will this be an issue? Blocks of an awkward size – say a block that happen to have a massive KeyValue in them and they exceed massively the default block size – would need to be treated differently?
        Hide
        Prakash Khemani added a comment -

        initial diff for feedback https://reviews.facebook.net/D1635

        Show
        Prakash Khemani added a comment - initial diff for feedback https://reviews.facebook.net/D1635
        Hide
        Prakash Khemani added a comment -

        another advantage of this approach will be that we will be able to get rid of low/high water marks in LRUBlockCache and make block eviction synchronous with demand. The default value of the watermarks is set to 75% and 85% (in 89). That means we waste somewhere around 20% of the block-cache today because of asynchronous garbage collection.

        Show
        Prakash Khemani added a comment - another advantage of this approach will be that we will be able to get rid of low/high water marks in LRUBlockCache and make block eviction synchronous with demand. The default value of the watermarks is set to 75% and 85% (in 89). That means we waste somewhere around 20% of the block-cache today because of asynchronous garbage collection.
        Hide
        Ted Yu added a comment -

        Agreed.
        This initiative is on the right track.

        Show
        Ted Yu added a comment - Agreed. This initiative is on the right track.
        Hide
        Lars Hofhansl added a comment -

        Nice patch.
        How does this interact with the off-heap cache effort?

        Show
        Lars Hofhansl added a comment - Nice patch. How does this interact with the off-heap cache effort?
        Hide
        Lars Hofhansl added a comment -

        Seems to me that we should target this for 0.96.

        Show
        Lars Hofhansl added a comment - Seems to me that we should target this for 0.96.
        Hide
        Prakash Khemani added a comment -

        If this approach works then we will not need to do off-heap caching anymore.

        Show
        Prakash Khemani added a comment - If this approach works then we will not need to do off-heap caching anymore.
        Hide
        Phabricator added a comment -

        khemani has commented on the revision "HBASE-5347 [jira] GC free memory management in Level-1 Block Cache".

        INLINE COMMENTS
        src/main/java/org/apache/hadoop/hbase/KeyValue.java:2099 Stack, there is a lot of cleanup needed. I will get back to your feedback soon.

        Regarding the ReferenceQueue mechanism and the fact that it gets cleaned up at best at GC speeds ... yes you are right that if all dead KVs ended up on this reference-queue then this approach won't be any better. But if a kv.deref() is called explicitly before the kv is put up for garbage collection then that kv will not land on this reference-queue.

        The reference-queue is only there to catch those instances where we might have forgotten to do an explicit kv.deref(). Hopefully there won't be too many such cases.

        REVISION DETAIL
        https://reviews.facebook.net/D1635

        Show
        Phabricator added a comment - khemani has commented on the revision " HBASE-5347 [jira] GC free memory management in Level-1 Block Cache". INLINE COMMENTS src/main/java/org/apache/hadoop/hbase/KeyValue.java:2099 Stack, there is a lot of cleanup needed. I will get back to your feedback soon. Regarding the ReferenceQueue mechanism and the fact that it gets cleaned up at best at GC speeds ... yes you are right that if all dead KVs ended up on this reference-queue then this approach won't be any better. But if a kv.deref() is called explicitly before the kv is put up for garbage collection then that kv will not land on this reference-queue. The reference-queue is only there to catch those instances where we might have forgotten to do an explicit kv.deref(). Hopefully there won't be too many such cases. REVISION DETAIL https://reviews.facebook.net/D1635
        Hide
        Phabricator added a comment -

        khemani updated the revision "HBASE-5347 [jira] GC free memory management in Level-1 Block Cache".
        Reviewers: Kannan, mbautin, dhruba, nspiegelberg, stack, JIRA

        In this iteration I have been testing these changes by starting in the region server a 1000 threads that continuously call incrementColumnValue() on large random keys. In this patch there are some temporary changes to HRegionInterface to do this testing ...

        In the test that I run on my dev cluster, there is enough data in the test table, the java heap is set to 24GB, block-cache is set to 60% of the heap. Without this patch I pretty soon run into GC pauses. With this patch, and even without any blocks-out-of-large-slab allocation and without changing to synchronous eviction in lru block cache, there are no GC pauses.

        So, there are two kinds of objects being reference counted in this patch - HFileBlocks and KeyValues. HFileBlocks are easy to reference counts because they only exist in HFileScanners and below. KeyValues (those ones who directly refer to data in HFileBlock) are more difficult to reference count.

        My earlier approach was that the StoreFileScanner will hand out pinned key-values and the higher layers will not have to bother about deref()ing the key-values. The key-values will be deref()'d only when they are written out. But this approach has many issues and quickly becomes non-intuitive. For example, with that older approach, StoreFileScanner will not be able to deref() its cur kv when it is closed and it will have to rely on the GC for cleanup.

        In the current approach you don't have to ref() the kv that you get from a scanner if you are going to keep it beyond the next scanner's next(), seek(), reseek() or close() call. (If you don't ref it then you will soon get a nullptrexception). You should also deref it when you are done, but it is not absolutely required.

        I have lot more cleanup to do. I will revert some of the method name changes. I will refactor the code in new classes. There is lot many tests to be written. I will re-write this in a way such that it can be run with or without the hfile-block-pool. The block pool enhancements and lru-block cache enhancements in separate diffs.

        This changes the programming model quite a bit ...

        Some number from the test that I am running

        totKVRef 49046639
        totKVDeref 49022691
        totKVDead 17
        totKVDeadDeadDead 1839873

        49 million KVs have been reference counted. Out of that somehow the code forgets to deref 17 of them. About 2 million of them end up on the deadKVs ReferenceQueue even though they have been properly deref'd ... don't know why this happens.

        curHFBPoolBlocks 165746
        curHFBPoolCapacity 11541225472
        curHFBPoolFreeBlocks 22050
        curHFBPoolFreeCapacity 1535385600

        The block cache has 10 GB of active blocks. 1.5GB is maintained in the free pool by the eviction process. We will be able to use all the block cache memory (that is around 14.5GB) once LRUBlockCache eviction is made synchronous.

        totHFBPoolSystemAlloc 235873
        totHFBPoolSystemFree 60075
        totHFBPoolBlocksAllocated 165746
        totHFBPoolLargeBlocksAllocated 70127
        totHFBPoolLargeBlocksFreed 60075
        totHFBPoolReuse 5459025

        235K allocations were done from the system. Some of these were blocks larger than 69K so they were not managed by the pool. The blocks got resused 5.5m times.

        216732

        REVISION DETAIL
        https://reviews.facebook.net/D1635

        AFFECTED FILES
        pom.xml
        src/main/java/org/apache/hadoop/hbase/KeyValue.java
        src/main/java/org/apache/hadoop/hbase/client/HTable.java
        src/main/java/org/apache/hadoop/hbase/client/Result.java
        src/main/java/org/apache/hadoop/hbase/io/HalfStoreFileReader.java
        src/main/java/org/apache/hadoop/hbase/io/hfile/BlockCache.java
        src/main/java/org/apache/hadoop/hbase/io/hfile/BlockType.java
        src/main/java/org/apache/hadoop/hbase/io/hfile/CacheConfig.java
        src/main/java/org/apache/hadoop/hbase/io/hfile/Cacheable.java
        src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java
        src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java
        src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlockIndex.java
        src/main/java/org/apache/hadoop/hbase/io/hfile/HFilePrettyPrinter.java
        src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV1.java
        src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java
        src/main/java/org/apache/hadoop/hbase/io/hfile/HFileScanner.java
        src/main/java/org/apache/hadoop/hbase/io/hfile/HFileWriterV1.java
        src/main/java/org/apache/hadoop/hbase/io/hfile/HFileWriterV2.java
        src/main/java/org/apache/hadoop/hbase/io/hfile/LruBlockCache.java
        src/main/java/org/apache/hadoop/hbase/io/hfile/SimpleBlockCache.java
        src/main/java/org/apache/hadoop/hbase/ipc/HRegionInterface.java
        src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java
        src/main/java/org/apache/hadoop/hbase/regionserver/GetClosestRowBeforeTracker.java
        src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java
        src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
        src/main/java/org/apache/hadoop/hbase/regionserver/KeyValueHeap.java
        src/main/java/org/apache/hadoop/hbase/regionserver/Store.java
        src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java
        src/main/java/org/apache/hadoop/hbase/regionserver/StoreFileScanner.java
        src/main/java/org/apache/hadoop/hbase/regionserver/StoreScanner.java
        src/main/java/org/apache/hadoop/hbase/regionserver/metrics/SchemaConfigured.java
        src/main/java/org/apache/hadoop/hbase/util/CompoundBloomFilter.java
        src/main/java/org/apache/hadoop/hbase/util/Counters.java
        src/main/java/org/apache/hadoop/hbase/util/DoublyLinkedListElement.java
        src/main/java/org/apache/hadoop/hbase/util/DoublyLinkedListHead.java
        src/test/java/org/apache/hadoop/hbase/io/TestHalfStoreFileReader.java
        src/test/java/org/apache/hadoop/hbase/io/hfile/TestCacheOnWrite.java
        src/test/java/org/apache/hadoop/hbase/io/hfile/TestCachedBlockQueue.java
        src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFile.java
        src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFileBlock.java
        src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFileBlockIndex.java
        src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFileWriterV2.java
        src/test/java/org/apache/hadoop/hbase/io/hfile/TestLruBlockCache.java
        src/test/java/org/apache/hadoop/hbase/io/hfile/TestSeekTo.java
        src/test/java/org/apache/hadoop/hbase/mapreduce/TestHFileOutputFormat.java
        src/test/java/org/apache/hadoop/hbase/regionserver/TestCompaction.java
        src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegion.java

        Show
        Phabricator added a comment - khemani updated the revision " HBASE-5347 [jira] GC free memory management in Level-1 Block Cache". Reviewers: Kannan, mbautin, dhruba, nspiegelberg, stack, JIRA In this iteration I have been testing these changes by starting in the region server a 1000 threads that continuously call incrementColumnValue() on large random keys. In this patch there are some temporary changes to HRegionInterface to do this testing ... In the test that I run on my dev cluster, there is enough data in the test table, the java heap is set to 24GB, block-cache is set to 60% of the heap. Without this patch I pretty soon run into GC pauses. With this patch, and even without any blocks-out-of-large-slab allocation and without changing to synchronous eviction in lru block cache, there are no GC pauses. So, there are two kinds of objects being reference counted in this patch - HFileBlocks and KeyValues. HFileBlocks are easy to reference counts because they only exist in HFileScanners and below. KeyValues (those ones who directly refer to data in HFileBlock) are more difficult to reference count. My earlier approach was that the StoreFileScanner will hand out pinned key-values and the higher layers will not have to bother about deref()ing the key-values. The key-values will be deref()'d only when they are written out. But this approach has many issues and quickly becomes non-intuitive. For example, with that older approach, StoreFileScanner will not be able to deref() its cur kv when it is closed and it will have to rely on the GC for cleanup. In the current approach you don't have to ref() the kv that you get from a scanner if you are going to keep it beyond the next scanner's next(), seek(), reseek() or close() call. (If you don't ref it then you will soon get a nullptrexception). You should also deref it when you are done, but it is not absolutely required. I have lot more cleanup to do. I will revert some of the method name changes. I will refactor the code in new classes. There is lot many tests to be written. I will re-write this in a way such that it can be run with or without the hfile-block-pool. The block pool enhancements and lru-block cache enhancements in separate diffs. This changes the programming model quite a bit ... Some number from the test that I am running totKVRef 49046639 totKVDeref 49022691 totKVDead 17 totKVDeadDeadDead 1839873 49 million KVs have been reference counted. Out of that somehow the code forgets to deref 17 of them. About 2 million of them end up on the deadKVs ReferenceQueue even though they have been properly deref'd ... don't know why this happens. curHFBPoolBlocks 165746 curHFBPoolCapacity 11541225472 curHFBPoolFreeBlocks 22050 curHFBPoolFreeCapacity 1535385600 The block cache has 10 GB of active blocks. 1.5GB is maintained in the free pool by the eviction process. We will be able to use all the block cache memory (that is around 14.5GB) once LRUBlockCache eviction is made synchronous. totHFBPoolSystemAlloc 235873 totHFBPoolSystemFree 60075 totHFBPoolBlocksAllocated 165746 totHFBPoolLargeBlocksAllocated 70127 totHFBPoolLargeBlocksFreed 60075 totHFBPoolReuse 5459025 235K allocations were done from the system. Some of these were blocks larger than 69K so they were not managed by the pool. The blocks got resused 5.5m times. 216732 REVISION DETAIL https://reviews.facebook.net/D1635 AFFECTED FILES pom.xml src/main/java/org/apache/hadoop/hbase/KeyValue.java src/main/java/org/apache/hadoop/hbase/client/HTable.java src/main/java/org/apache/hadoop/hbase/client/Result.java src/main/java/org/apache/hadoop/hbase/io/HalfStoreFileReader.java src/main/java/org/apache/hadoop/hbase/io/hfile/BlockCache.java src/main/java/org/apache/hadoop/hbase/io/hfile/BlockType.java src/main/java/org/apache/hadoop/hbase/io/hfile/CacheConfig.java src/main/java/org/apache/hadoop/hbase/io/hfile/Cacheable.java src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlockIndex.java src/main/java/org/apache/hadoop/hbase/io/hfile/HFilePrettyPrinter.java src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV1.java src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java src/main/java/org/apache/hadoop/hbase/io/hfile/HFileScanner.java src/main/java/org/apache/hadoop/hbase/io/hfile/HFileWriterV1.java src/main/java/org/apache/hadoop/hbase/io/hfile/HFileWriterV2.java src/main/java/org/apache/hadoop/hbase/io/hfile/LruBlockCache.java src/main/java/org/apache/hadoop/hbase/io/hfile/SimpleBlockCache.java src/main/java/org/apache/hadoop/hbase/ipc/HRegionInterface.java src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java src/main/java/org/apache/hadoop/hbase/regionserver/GetClosestRowBeforeTracker.java src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java src/main/java/org/apache/hadoop/hbase/regionserver/KeyValueHeap.java src/main/java/org/apache/hadoop/hbase/regionserver/Store.java src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java src/main/java/org/apache/hadoop/hbase/regionserver/StoreFileScanner.java src/main/java/org/apache/hadoop/hbase/regionserver/StoreScanner.java src/main/java/org/apache/hadoop/hbase/regionserver/metrics/SchemaConfigured.java src/main/java/org/apache/hadoop/hbase/util/CompoundBloomFilter.java src/main/java/org/apache/hadoop/hbase/util/Counters.java src/main/java/org/apache/hadoop/hbase/util/DoublyLinkedListElement.java src/main/java/org/apache/hadoop/hbase/util/DoublyLinkedListHead.java src/test/java/org/apache/hadoop/hbase/io/TestHalfStoreFileReader.java src/test/java/org/apache/hadoop/hbase/io/hfile/TestCacheOnWrite.java src/test/java/org/apache/hadoop/hbase/io/hfile/TestCachedBlockQueue.java src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFile.java src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFileBlock.java src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFileBlockIndex.java src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFileWriterV2.java src/test/java/org/apache/hadoop/hbase/io/hfile/TestLruBlockCache.java src/test/java/org/apache/hadoop/hbase/io/hfile/TestSeekTo.java src/test/java/org/apache/hadoop/hbase/mapreduce/TestHFileOutputFormat.java src/test/java/org/apache/hadoop/hbase/regionserver/TestCompaction.java src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegion.java
        Hide
        Phabricator added a comment -

        tedyu has commented on the revision "HBASE-5347 [jira] GC free memory management in Level-1 Block Cache".

        INLINE COMMENTS
        src/main/java/org/apache/hadoop/hbase/KeyValue.java:2193 Should we use some timeout for the remove() call ?
        remove() is a blocking call, see:
        http://docs.oracle.com/javase/1.4.2/docs/api/java/lang/ref/ReferenceQueue.html#remove%28%29

        Would the current code make the cleanup thread in region server less useful ?
        src/main/java/org/apache/hadoop/hbase/KeyValue.java:2207 Please rename this counter totKVOnRefQueue or something similar
        src/main/java/org/apache/hadoop/hbase/KeyValue.java:2260 When would the return value be used ?
        src/main/java/org/apache/hadoop/hbase/KeyValue.java:2245 this.bytes is cleared upon exit of deref().
        Please document this in javadoc.

        REVISION DETAIL
        https://reviews.facebook.net/D1635

        Show
        Phabricator added a comment - tedyu has commented on the revision " HBASE-5347 [jira] GC free memory management in Level-1 Block Cache". INLINE COMMENTS src/main/java/org/apache/hadoop/hbase/KeyValue.java:2193 Should we use some timeout for the remove() call ? remove() is a blocking call, see: http://docs.oracle.com/javase/1.4.2/docs/api/java/lang/ref/ReferenceQueue.html#remove%28%29 Would the current code make the cleanup thread in region server less useful ? src/main/java/org/apache/hadoop/hbase/KeyValue.java:2207 Please rename this counter totKVOnRefQueue or something similar src/main/java/org/apache/hadoop/hbase/KeyValue.java:2260 When would the return value be used ? src/main/java/org/apache/hadoop/hbase/KeyValue.java:2245 this.bytes is cleared upon exit of deref(). Please document this in javadoc. REVISION DETAIL https://reviews.facebook.net/D1635
        Hide
        Phabricator added a comment -

        mbautin has commented on the revision "HBASE-5347 [jira] GC free memory management in Level-1 Block Cache".

        Looks pretty good (provided that it works as intended). Some comments inline.

        Could you please add some description of the pinning/unpinning and ref/deref concept (and the difference between the two) somewhere in javadoc? Otherwise a lot of this code is very difficult to read. The added complexity is justified by the performance benefit, but we will have to maintain this code for a long time, so it should be as self-documenting as possible (think about new developers joining the project).

        INLINE COMMENTS
        src/main/java/org/apache/hadoop/hbase/KeyValue.java:2180 Remove this line
        src/main/java/org/apache/hadoop/hbase/util/DoublyLinkedListElement.java:1 License
        src/main/java/org/apache/hadoop/hbase/util/DoublyLinkedListHead.java:1 License
        src/main/java/org/apache/hadoop/hbase/KeyValue.java:1222 It would be useful to mention in the javadoc that the KV is being dereferenced in this method because it does not refer to the backing array of an HFile block anymore.
        src/main/java/org/apache/hadoop/hbase/KeyValue.java:2091 Could you put the code specific to reference counting code into some other class? KeyValue is already huge.
        src/main/java/org/apache/hadoop/hbase/KeyValue.java:2141-2186 Make these counters final.
        src/main/java/org/apache/hadoop/hbase/KeyValue.java:2192 Remove this line and other commented-out code in this function.
        src/main/java/org/apache/hadoop/hbase/KeyValue.java:2222 Is this increment thread-safe?
        src/main/java/org/apache/hadoop/hbase/KeyValue.java:2237 Naming conventions: listNum.

        A more serious comment: is it possible that one thread will call ref() much more often? If so, wouldn't it be better to load-balance between in-use lists randomly? Note that with the current solution we still have to lock at an individual list level, because multiple threads can still map to the same in-use list. But I agree that hashing on thread id decreases the amount of contention.
        src/main/java/org/apache/hadoop/hbase/client/HTable.java:623-658 • It might not be a good idea to add test-only methods to HTable, because that class is used directly by a lot of clients. In other words, clients pass around HTable instances and don't cast them to HTableInterface.

        I think we should move these methods to a test-only class and/or make them package-private.

        • Replace "anyrandomrow" with a constant.
        src/main/java/org/apache/hadoop/hbase/io/HalfStoreFileReader.java:112 This method name change is not terribly useful without an explanation of what "pinned" is. Would it be better to explain this in javadoc and keep the old name? This also applies to all other changes of method names including the word "Pinned" or "AndPin".
        src/main/java/org/apache/hadoop/hbase/io/hfile/BlockCache.java:39 Are these name changes necessary? Maybe it is better to just add javadoc explaining the "pinning" part.
        src/main/java/org/apache/hadoop/hbase/io/hfile/CacheConfig.java:322 Nice refactoring
        src/main/java/org/apache/hadoop/hbase/io/hfile/Cacheable.java:39-43 We definitely need javadoc describing what pin/unpin is and how to use it.
        src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java:373 What exactly is pinned here, and how would the caller unpin it, if that's the intent of the method name change? The return value is an output stream.
        src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java:119 Make this final?
        src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java:168 initialize2 is not a very descriptive counter name
        src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java:1092 Is it possible to separate "pool allocator" interface from implementation, rather than just calling a static method? This way we could swap in alternative allocators in tests.
        src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java:1499 Please move the pool code to a separate class. Also, it would be nice to separate pool allocator interface from implementation, as I mentioned.
        src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlockIndex.java:196 Should these be IOExceptions, since the method already throws them?
        src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV1.java:412 "Inx" is a very rare abbreviation for "index". Better to spell out "Index".
        src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV1.java:694-695 Make final
        src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java:1628-1669 Make these final.
        src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java:408 Would the additional copy cause a performance regression based on this method's use cases?
        src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java:417 Would the additional copy cause a performance regression based on this method's use cases?
        src/main/java/org/apache/hadoop/hbase/io/hfile/HFileScanner.java:128 add a "}" at the end of the "{@link KeyValue"
        src/main/java/org/apache/hadoop/hbase/io/hfile/HFileScanner.java:127 in any HFileBlock
        src/main/java/org/apache/hadoop/hbase/io/hfile/LruBlockCache.java:390-391 Make final
        src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java:2874-2876 Make final
        src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java:3071 Fix formatting ("i =0").
        src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java:2812-2878 This has to be in a separate class.
        src/main/java/org/apache/hadoop/hbase/regionserver/KeyValueHeap.java:410-411 Make final
        src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java:1404 Do we unpin on close? As far as I remember, this used to be a wrapper over a ByteArrayInputStream.
        src/main/java/org/apache/hadoop/hbase/regionserver/StoreFileScanner.java:399-400 Make final
        src/main/java/org/apache/hadoop/hbase/regionserver/StoreScanner.java:547-548 Make final
        src/main/java/org/apache/hadoop/hbase/regionserver/metrics/SchemaConfigured.java:37 Why would we need to log into this class's logger from a subclass? Please define another logger in the subclass directly instead.
        src/main/java/org/apache/hadoop/hbase/util/Counters.java:1 License
        src/main/java/org/apache/hadoop/hbase/util/Counters.java:25 Naming convention: rename to COUNTER_CLASSES
        src/main/java/org/apache/hadoop/hbase/util/Counters.java:23 This class probably needs a better name, since it is very specific kind of counters accessed through reflection.
        src/main/java/org/apache/hadoop/hbase/util/DoublyLinkedListElement.java:1 License
        src/main/java/org/apache/hadoop/hbase/util/DoublyLinkedListHead.java:1 License
        src/main/java/org/apache/hadoop/hbase/util/DoublyLinkedListHead.java:4 Make private?
        src/test/java/org/apache/hadoop/hbase/io/hfile/TestLruBlockCache.java:83 Most changes in this class and a lot of other changes in this diff are simply name changes. Not sure if they are necessary. One advantage is definitely in reminding the caller to unpin blocks.

        REVISION DETAIL
        https://reviews.facebook.net/D1635

        Show
        Phabricator added a comment - mbautin has commented on the revision " HBASE-5347 [jira] GC free memory management in Level-1 Block Cache". Looks pretty good (provided that it works as intended). Some comments inline. Could you please add some description of the pinning/unpinning and ref/deref concept (and the difference between the two) somewhere in javadoc? Otherwise a lot of this code is very difficult to read. The added complexity is justified by the performance benefit, but we will have to maintain this code for a long time, so it should be as self-documenting as possible (think about new developers joining the project). INLINE COMMENTS src/main/java/org/apache/hadoop/hbase/KeyValue.java:2180 Remove this line src/main/java/org/apache/hadoop/hbase/util/DoublyLinkedListElement.java:1 License src/main/java/org/apache/hadoop/hbase/util/DoublyLinkedListHead.java:1 License src/main/java/org/apache/hadoop/hbase/KeyValue.java:1222 It would be useful to mention in the javadoc that the KV is being dereferenced in this method because it does not refer to the backing array of an HFile block anymore. src/main/java/org/apache/hadoop/hbase/KeyValue.java:2091 Could you put the code specific to reference counting code into some other class? KeyValue is already huge. src/main/java/org/apache/hadoop/hbase/KeyValue.java:2141-2186 Make these counters final. src/main/java/org/apache/hadoop/hbase/KeyValue.java:2192 Remove this line and other commented-out code in this function. src/main/java/org/apache/hadoop/hbase/KeyValue.java:2222 Is this increment thread-safe? src/main/java/org/apache/hadoop/hbase/KeyValue.java:2237 Naming conventions: listNum. A more serious comment: is it possible that one thread will call ref() much more often? If so, wouldn't it be better to load-balance between in-use lists randomly? Note that with the current solution we still have to lock at an individual list level, because multiple threads can still map to the same in-use list. But I agree that hashing on thread id decreases the amount of contention. src/main/java/org/apache/hadoop/hbase/client/HTable.java:623-658 • It might not be a good idea to add test-only methods to HTable, because that class is used directly by a lot of clients. In other words, clients pass around HTable instances and don't cast them to HTableInterface. I think we should move these methods to a test-only class and/or make them package-private. • Replace "anyrandomrow" with a constant. src/main/java/org/apache/hadoop/hbase/io/HalfStoreFileReader.java:112 This method name change is not terribly useful without an explanation of what "pinned" is. Would it be better to explain this in javadoc and keep the old name? This also applies to all other changes of method names including the word "Pinned" or "AndPin". src/main/java/org/apache/hadoop/hbase/io/hfile/BlockCache.java:39 Are these name changes necessary? Maybe it is better to just add javadoc explaining the "pinning" part. src/main/java/org/apache/hadoop/hbase/io/hfile/CacheConfig.java:322 Nice refactoring src/main/java/org/apache/hadoop/hbase/io/hfile/Cacheable.java:39-43 We definitely need javadoc describing what pin/unpin is and how to use it. src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java:373 What exactly is pinned here, and how would the caller unpin it, if that's the intent of the method name change? The return value is an output stream. src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java:119 Make this final? src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java:168 initialize2 is not a very descriptive counter name src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java:1092 Is it possible to separate "pool allocator" interface from implementation, rather than just calling a static method? This way we could swap in alternative allocators in tests. src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java:1499 Please move the pool code to a separate class. Also, it would be nice to separate pool allocator interface from implementation, as I mentioned. src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlockIndex.java:196 Should these be IOExceptions, since the method already throws them? src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV1.java:412 "Inx" is a very rare abbreviation for "index". Better to spell out "Index". src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV1.java:694-695 Make final src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java:1628-1669 Make these final. src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java:408 Would the additional copy cause a performance regression based on this method's use cases? src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java:417 Would the additional copy cause a performance regression based on this method's use cases? src/main/java/org/apache/hadoop/hbase/io/hfile/HFileScanner.java:128 add a "}" at the end of the "{@link KeyValue" src/main/java/org/apache/hadoop/hbase/io/hfile/HFileScanner.java:127 in any HFileBlock src/main/java/org/apache/hadoop/hbase/io/hfile/LruBlockCache.java:390-391 Make final src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java:2874-2876 Make final src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java:3071 Fix formatting ("i =0"). src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java:2812-2878 This has to be in a separate class. src/main/java/org/apache/hadoop/hbase/regionserver/KeyValueHeap.java:410-411 Make final src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java:1404 Do we unpin on close? As far as I remember, this used to be a wrapper over a ByteArrayInputStream. src/main/java/org/apache/hadoop/hbase/regionserver/StoreFileScanner.java:399-400 Make final src/main/java/org/apache/hadoop/hbase/regionserver/StoreScanner.java:547-548 Make final src/main/java/org/apache/hadoop/hbase/regionserver/metrics/SchemaConfigured.java:37 Why would we need to log into this class's logger from a subclass? Please define another logger in the subclass directly instead. src/main/java/org/apache/hadoop/hbase/util/Counters.java:1 License src/main/java/org/apache/hadoop/hbase/util/Counters.java:25 Naming convention: rename to COUNTER_CLASSES src/main/java/org/apache/hadoop/hbase/util/Counters.java:23 This class probably needs a better name, since it is very specific kind of counters accessed through reflection. src/main/java/org/apache/hadoop/hbase/util/DoublyLinkedListElement.java:1 License src/main/java/org/apache/hadoop/hbase/util/DoublyLinkedListHead.java:1 License src/main/java/org/apache/hadoop/hbase/util/DoublyLinkedListHead.java:4 Make private? src/test/java/org/apache/hadoop/hbase/io/hfile/TestLruBlockCache.java:83 Most changes in this class and a lot of other changes in this diff are simply name changes. Not sure if they are necessary. One advantage is definitely in reminding the caller to unpin blocks. REVISION DETAIL https://reviews.facebook.net/D1635
        Hide
        Todd Lipcon added a comment -

        Hey folks. I haven't been through the patch yet, but just wanted to throw out one idea that I think can make reference-counted systems a little simpler: in Cocoa (the OSX development framework) there's a class called NSAutoreleasePool, an instance of which is carried around as part of the local thread context. You can then call "autorelease" on any object, which will not immediately decrement the ref count, but adds it to the pool. When you release the pool, all referenced objects are decremented at that point.

        This idea might make it easier to manage references. For example, when something is read by a scanner, it could be read with ref count incremented but put on the request's autorelease pool. Then, when any IPC handler thread is returned to the thread pool, the auto release pool could be decremented. This ensures that any stuff we reference is kept around for the whole request lifecycle but still automatically dereffed at the end.

        Do you think such a construct would be useful here?

        https://developer.apple.com/library/mac/#documentation/Cocoa/Conceptual/MemoryMgmt/Articles/mmAutoreleasePools.html has some more info.

        Show
        Todd Lipcon added a comment - Hey folks. I haven't been through the patch yet, but just wanted to throw out one idea that I think can make reference-counted systems a little simpler: in Cocoa (the OSX development framework) there's a class called NSAutoreleasePool, an instance of which is carried around as part of the local thread context. You can then call "autorelease" on any object, which will not immediately decrement the ref count, but adds it to the pool. When you release the pool, all referenced objects are decremented at that point. This idea might make it easier to manage references. For example, when something is read by a scanner, it could be read with ref count incremented but put on the request's autorelease pool. Then, when any IPC handler thread is returned to the thread pool, the auto release pool could be decremented. This ensures that any stuff we reference is kept around for the whole request lifecycle but still automatically dereffed at the end. Do you think such a construct would be useful here? https://developer.apple.com/library/mac/#documentation/Cocoa/Conceptual/MemoryMgmt/Articles/mmAutoreleasePools.html has some more info.
        Hide
        Ted Yu added a comment -

        In filterRow(), we should deref the KeyValue which is removed from the list of KeyValues parameter.
        In 0.89-fb branch, I found DependentColumnFilter where the above change should be added.

        Show
        Ted Yu added a comment - In filterRow(), we should deref the KeyValue which is removed from the list of KeyValues parameter. In 0.89-fb branch, I found DependentColumnFilter where the above change should be added.
        Hide
        Prakash Khemani added a comment -

        Thanks for the reviews. I will make another pass pretty soon.

        Todd, What you have suggested will not help when a piece of code wants to
        bump up the reference count because it is keeping the reference to the
        object in some state variable. (for example the store scanner does it
        after a flush). But your suggestion will work in those cases where a piece
        of code forgets to decrement the reference count. The current patch also
        has similar characteristics. It cannot automatically bump up the reference
        count. But if it forgets to do dereference and decrement the ref count
        then it relies on WeakReferences and garbage collection to do the same.

        On 2/19/12 12:56 AM, "Zhihong Yu (Commented) (JIRA)" <jira@apache.org>

        Show
        Prakash Khemani added a comment - Thanks for the reviews. I will make another pass pretty soon. Todd, What you have suggested will not help when a piece of code wants to bump up the reference count because it is keeping the reference to the object in some state variable. (for example the store scanner does it after a flush). But your suggestion will work in those cases where a piece of code forgets to decrement the reference count. The current patch also has similar characteristics. It cannot automatically bump up the reference count. But if it forgets to do dereference and decrement the ref count then it relies on WeakReferences and garbage collection to do the same. On 2/19/12 12:56 AM, "Zhihong Yu (Commented) (JIRA)" <jira@apache.org>
        Hide
        Lars Hofhansl added a comment -

        @Ted: That might be right thing to do in filterRow(), but I am slowly getting this feeling that this will put a lot of onus to account for this on various HBase parts (such as filters).

        Show
        Lars Hofhansl added a comment - @Ted: That might be right thing to do in filterRow(), but I am slowly getting this feeling that this will put a lot of onus to account for this on various HBase parts (such as filters).
        Hide
        Phabricator added a comment -

        khemani has commented on the revision "HBASE-5347 [jira] GC free memory management in Level-1 Block Cache".

        INLINE COMMENTS
        src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java:417 I had changed getKey() and getValue() to copy the data because these aren't in any performance critical path
        src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java:1404 yes, we unpin on close. That is why I had to change the return of getGeneralBloomFilter() from DataInput to DataInputStream

        REVISION DETAIL
        https://reviews.facebook.net/D1635

        Show
        Phabricator added a comment - khemani has commented on the revision " HBASE-5347 [jira] GC free memory management in Level-1 Block Cache". INLINE COMMENTS src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java:417 I had changed getKey() and getValue() to copy the data because these aren't in any performance critical path src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java:1404 yes, we unpin on close. That is why I had to change the return of getGeneralBloomFilter() from DataInput to DataInputStream REVISION DETAIL https://reviews.facebook.net/D1635
        Hide
        Ted Yu added a comment -

        @Lars:
        Just saw your comment @ 20/Feb/12 00:08

        I agree with your observation - it is hard to know whether user filters would do the right action.
        I was thinking that the deref can be done outside filterRow() - in nextInternal() e.g.
        How about:
        1. call deref() for every element in results
        2. call filterRow()
        3. call ref() on the remaining elements in results

        Show
        Ted Yu added a comment - @Lars: Just saw your comment @ 20/Feb/12 00:08 I agree with your observation - it is hard to know whether user filters would do the right action. I was thinking that the deref can be done outside filterRow() - in nextInternal() e.g. How about: 1. call deref() for every element in results 2. call filterRow() 3. call ref() on the remaining elements in results
        Hide
        Lars Hofhansl added a comment -

        That should work, and it would not put any burden on the writer on the author of a new Filter.
        I need to look at the patch again to get a sense of a extend of extra code we need to sprinkle all over the HBase code base to account for the KVs. On first path it didn't seem horrid.

        I am generally curious, though, (almost) all the cached blocks have the exact same size, right? The GC (in theory) should have a relatively easy time to reuse the freed space (and it's interesting - to me anyway - that it doesn't do this efficiently).

        Show
        Lars Hofhansl added a comment - That should work, and it would not put any burden on the writer on the author of a new Filter. I need to look at the patch again to get a sense of a extend of extra code we need to sprinkle all over the HBase code base to account for the KVs. On first path it didn't seem horrid. I am generally curious, though, (almost) all the cached blocks have the exact same size, right? The GC (in theory) should have a relatively easy time to reuse the freed space (and it's interesting - to me anyway - that it doesn't do this efficiently).
        Hide
        Ted Yu added a comment -

        I think we can introduce a new config parameter called 'hbase.filter.auto.refcount' e.g.
        If the value of this config is true (default), we use procedure described @ 22/Feb/12 23:02
        If the value is false, we assume (with good release notes of course) the filter writer is fully aware of what he/she is doing.

        Show
        Ted Yu added a comment - I think we can introduce a new config parameter called 'hbase.filter.auto.refcount' e.g. If the value of this config is true (default), we use procedure described @ 22/Feb/12 23:02 If the value is false, we assume (with good release notes of course) the filter writer is fully aware of what he/she is doing.
        Hide
        Todd Lipcon added a comment -

        Rather than introduce a global config, each filter could use an annotation or marker (ie empty) interface, eg:

        class MyFilter extends FilterBase implements RefCountAware
        

        or

        @RefCountAware
        class MyFilter extends FilterBase
        
        Show
        Todd Lipcon added a comment - Rather than introduce a global config, each filter could use an annotation or marker (ie empty) interface, eg: class MyFilter extends FilterBase implements RefCountAware or @RefCountAware class MyFilter extends FilterBase
        Hide
        Prakash Khemani added a comment -

        Lars, you are right. I have been trying to induce a Full GC but without
        any success. (I can induce a full GC if I artificially hold some
        key-values in queue and force them to be tenured.)

        On 89-fb, my test-case is doing random increments on a space of slightly
        more than 40GB worth of Key-value data. The heap is set to 36GB. The LRU
        cache has a high and low watermark of .98 and .85 percents. The region
        server spawns 1000 threads that continuously do the increments. The
        eviction thread manages to keep the block-cache at about 85% always.
        Cache-on-write is turned on to induce more cache churn. All the 12 disks
        are close to 100% read pegged. GC takes 60% of the CPU (sum of user times
        in 1000 lines of gc log / (elapsed time * #cpus)). Compactions that get
        started never complete while the load is on.

        I guess I have to change the dynamics of the test case to induce GC pauses.

        On 2/22/12 11:35 PM, "Todd Lipcon (Commented) (JIRA)" <jira@apache.org>
        wrote:

        Show
        Prakash Khemani added a comment - Lars, you are right. I have been trying to induce a Full GC but without any success. (I can induce a full GC if I artificially hold some key-values in queue and force them to be tenured.) On 89-fb, my test-case is doing random increments on a space of slightly more than 40GB worth of Key-value data. The heap is set to 36GB. The LRU cache has a high and low watermark of .98 and .85 percents. The region server spawns 1000 threads that continuously do the increments. The eviction thread manages to keep the block-cache at about 85% always. Cache-on-write is turned on to induce more cache churn. All the 12 disks are close to 100% read pegged. GC takes 60% of the CPU (sum of user times in 1000 lines of gc log / (elapsed time * #cpus)). Compactions that get started never complete while the load is on. I guess I have to change the dynamics of the test case to induce GC pauses. On 2/22/12 11:35 PM, "Todd Lipcon (Commented) (JIRA)" <jira@apache.org> wrote:
        Hide
        Phabricator added a comment -

        mbautin has resigned from the revision "HBASE-5347 [jira] GC free memory management in Level-1 Block Cache".

        We are not actively pursuing this approach anymore due to non-reproducibility of GC issues.

        REVISION DETAIL
        https://reviews.facebook.net/D1635

        Show
        Phabricator added a comment - mbautin has resigned from the revision " HBASE-5347 [jira] GC free memory management in Level-1 Block Cache". We are not actively pursuing this approach anymore due to non-reproducibility of GC issues. REVISION DETAIL https://reviews.facebook.net/D1635
        Hide
        stack added a comment -

        We are not actively pursuing this approach anymore due to non-reproducibility of GC issues.

        That sounds interesting. Was it that there was no discernible difference seen in GC managing the allocations ourselves? Would love to hear more if there are lessons to be had Mikhail (and Prakash). Good on you lads.

        Show
        stack added a comment - We are not actively pursuing this approach anymore due to non-reproducibility of GC issues. That sounds interesting. Was it that there was no discernible difference seen in GC managing the allocations ourselves? Would love to hear more if there are lessons to be had Mikhail (and Prakash). Good on you lads.
        Hide
        Phabricator added a comment -

        dhruba has resigned from the revision "HBASE-5347 [jira] GC free memory management in Level-1 Block Cache".

        REVISION DETAIL
        https://reviews.facebook.net/D1635

        To: Kannan, nspiegelberg, stack, JIRA, khemani
        Cc: HBase Diffs Facebook Group, tedyu, khemani, stack, nspiegelberg, mbautin

        Show
        Phabricator added a comment - dhruba has resigned from the revision " HBASE-5347 [jira] GC free memory management in Level-1 Block Cache". REVISION DETAIL https://reviews.facebook.net/D1635 To: Kannan, nspiegelberg, stack, JIRA, khemani Cc: HBase Diffs Facebook Group, tedyu, khemani, stack, nspiegelberg, mbautin
        Hide
        Ian Varley added a comment -

        Prakash, should this issue be closed if it's not being pursued any further? Sounds like there was broad general interest, but didn't end up being a useful direction? I'm changing this to "Later" for now, but feel free to outright close if you think it's no longer a useful direction.

        Show
        Ian Varley added a comment - Prakash, should this issue be closed if it's not being pursued any further? Sounds like there was broad general interest, but didn't end up being a useful direction? I'm changing this to "Later" for now, but feel free to outright close if you think it's no longer a useful direction.

          People

          • Assignee:
            Prakash Khemani
            Reporter:
            Prakash Khemani
          • Votes:
            0 Vote for this issue
            Watchers:
            15 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development