diff --git a/src/main/asciidoc/_chapters/architecture.adoc b/src/main/asciidoc/_chapters/architecture.adoc index 1f4b77c73c..6b8249bec5 100644 --- a/src/main/asciidoc/_chapters/architecture.adoc +++ b/src/main/asciidoc/_chapters/architecture.adoc @@ -655,7 +655,7 @@ Since HBase 0.98.4, the Block Cache detail has been significantly extended showi ==== Cache Choices -`LruBlockCache` is the original implementation, and is entirely within the Java heap. `BucketCache` is mainly intended for keeping block cache data off-heap, although `BucketCache` can also keep data on-heap and serve from a file-backed cache. +`LruBlockCache` is the original implementation, and is entirely within the Java heap. `BucketCache` is mainly intended for keeping block cache data off-heap, although `BucketCache` can also be a file-backed cache. .BucketCache is production ready as of HBase 0.98.6 [NOTE] @@ -663,7 +663,7 @@ Since HBase 0.98.4, the Block Cache detail has been significantly extended showi To run with BucketCache, you need HBASE-11678. This was included in 0.98.6. ==== - +Pre 2.0.0 HBase versions:: Fetching will always be slower when fetching from BucketCache, as compared to the native on-heap LruBlockCache. However, latencies tend to be less erratic across time, because there is less garbage collection when you use BucketCache since it is managing BlockCache allocations, not the GC. If the BucketCache is deployed in off-heap mode, this memory is not managed by the GC at all. @@ -673,7 +673,22 @@ Also see link:https://people.apache.org/~stack/bc/[Comparing BlockCache Deploys] When you enable BucketCache, you are enabling a two tier caching system, an L1 cache which is implemented by an instance of LruBlockCache and an off-heap L2 cache which is implemented by BucketCache. Management of these two tiers and the policy that dictates how blocks move between them is done by `CombinedBlockCache`. -It keeps all DATA blocks in the L2 BucketCache and meta blocks -- INDEX and BLOOM blocks -- on-heap in the L1 `LruBlockCache`. +By default it keeps all DATA blocks in the L2 BucketCache and meta blocks -- INDEX and BLOOM blocks -- on-heap in the L1 `LruBlockCache`. +But one can configure BucketCache as a victim L2 of the LruBlockCache. All Data and index blocks are cached in L1 first. When eviction happens from L1, the blocks will get moved to L2. + +Post 2.0.0 HBase versions:: +HBASE-11425 changed the HBase read path to end-to-end work with off heap so no need to copy the cached data to on heap. +This reduced the GC pauses to a great extend and make the off heap bucket cache performance similar/better to that of on heap LRU cache. +This feature is available from HBase 2.0.0. We recomend users to switch to off heap Bucket Cache as it will make the RegionServer to work with much lower heap size. +If the BucketCache is in file mode, fetching will always be slower as compared to the native on-heap LruBlockCache. +Refer to below blogs for more details and test results on off heaped read path +https://blogs.apache.org/hbase/entry/offheaping_the_read_path_in +https://blogs.apache.org/hbase/entry/offheap-read-path-in-production + +When you enable BucketCache, you are enabling a two tier caching system, an on heap LruBlockCache and an off-heap cache which is implemented by BucketCache. +Management of these two tiers and the policy that dictates which blocks are cached by which tier, is done by `CombinedBlockCache`. +It keeps all DATA blocks in the BucketCache and meta blocks -- INDEX and BLOOM blocks -- on-heap in the `LruBlockCache`. +Please note that we have removed the notion of L1 and L2 here. There is no movement of blocks from LruBlockCache to BucketCache or vice versa. See <> for more detail on going off-heap. [[cache.configurations]] @@ -729,13 +744,13 @@ The way to calculate how much memory is available in HBase for caching is: number of region servers * heap size * hfile.block.cache.size * 0.99 ---- -The default value for the block cache is 0.25 which represents 25% of the available heap. +The default value for the block cache is 0.4 which represents 40% of the available heap. The last value (99%) is the default acceptable loading factor in the LRU cache after which eviction is started. The reason it is included in this equation is that it would be unrealistic to say that it is possible to use 100% of the available memory since this would make the process blocking from the point where it loads new blocks. Here are some examples: -* One region server with the heap size set to 1 GB and the default block cache size will have 253 MB of block cache available. -* 20 region servers with the heap size set to 8 GB and a default block cache size will have 39.6 of block cache. +* One region server with the heap size set to 1 GB and the default block cache size will have 405 MB of block cache available. +* 20 region servers with the heap size set to 8 GB and a default block cache size will have 63.3 of block cache. * 100 region servers with the heap size set to 24 GB and a block cache size of 0.5 will have about 1.16 TB of block cache. Your data is not the only resident of the block cache. @@ -789,27 +804,27 @@ Since link:https://issues.apache.org/jira/browse/HBASE-4683[HBASE-4683 Always ca [[enable.bucketcache]] ===== How to Enable BucketCache -The usual deploy of BucketCache is via a managing class that sets up two caching tiers: an L1 on-heap cache implemented by LruBlockCache and a second L2 cache implemented with BucketCache. +The usual deploy of BucketCache is via a managing class that sets up two caching tiers: an on-heap cache implemented by LruBlockCache and a second cache implemented with BucketCache. The managing class is link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/io/hfile/CombinedBlockCache.html[CombinedBlockCache] by default. The previous link describes the caching 'policy' implemented by CombinedBlockCache. -In short, it works by keeping meta blocks -- INDEX and BLOOM in the L1, on-heap LruBlockCache tier -- and DATA blocks are kept in the L2, BucketCache tier. -It is possible to amend this behavior in HBase since version 1.0 and ask that a column family have both its meta and DATA blocks hosted on-heap in the L1 tier by setting `cacheDataInL1` via `(HColumnDescriptor.setCacheDataInL1(true)` or in the shell, creating or amending column families setting `CACHE_DATA_IN_L1` to true: e.g. +In short, it works by keeping meta blocks -- INDEX and BLOOM in the on-heap LruBlockCache tier -- and DATA blocks are kept in the BucketCache tier. +It is possible to amend this behavior in HBase version 1.x and ask that a column family have both its meta and DATA blocks hosted on-heap in the LruBlockCache by setting `cacheDataInL1` via `(HColumnDescriptor.setCacheDataInL1(true)` or in the shell, creating or amending column families setting `CACHE_DATA_IN_L1` to true: e.g. [source] ---- hbase(main):003:0> create 't', {NAME => 't', CONFIGURATION => {CACHE_DATA_IN_L1 => 'true'}} ---- - -The BucketCache Block Cache can be deployed on-heap, off-heap, or file based. +From HBase 2.0.0 on wards, the concept of L1 and L2 is removed. When BucketCache is turned on, the data blocks will always go to BucketCache and index, bloom blocks go to on heap LRUBlockCache. `cacheDataInL1` support is also removed. +The BucketCache Block Cache can be deployed off-heap, file or mmaped file mode. You set which via the `hbase.bucketcache.ioengine` setting. -Setting it to `heap` will have BucketCache deployed inside the allocated Java heap. -Setting it to `offheap` will have BucketCache make its allocations off-heap, and an ioengine setting of `file:PATH_TO_FILE` will direct BucketCache to use a file caching (Useful in particular if you have some fast I/O attached to the box such as SSDs). +Setting it to `offheap` will have BucketCache make its allocations off-heap, and an ioengine setting of `file:PATH_TO_FILE` will direct BucketCache to use a file caching (Useful in particular if you have some fast I/O attached to the box such as SSDs). From 2.0.0, it is possible to have more than one file backing the BucketCache. This is very useful specially when the Cache size requirement is so high. To use this way, configure ioengine as `files:PATH_TO_FILE1,PATH_TO_FILE2,PATH_TO_FILE3`. BucketCache can be configured to use an mmapped file also. Configure ioengine as `mmap:PATH_TO_FILE` for this. It is possible to deploy an L1+L2 setup where we bypass the CombinedBlockCache policy and have BucketCache working as a strict L2 cache to the L1 LruBlockCache. -For such a setup, set `CacheConfig.BUCKET_CACHE_COMBINED_KEY` to `false`. +For such a setup, set `hbase.bucketcache.combinedcache.enabled` to `false`. In this mode, on eviction from L1, blocks go to L2. When a block is cached, it is cached first in L1. When we go to look for a cached block, we look first in L1 and if none found, then search L2. Let us call this deploy format, _Raw L1+L2_. +NOTE: This L1+L2 mode is removed from 2.0.0. When BucketCache is been used, it will be strictly the Data cache and the LruBlockCache will act as the index/meta cache. Other BucketCache configs include: specifying a location to persist cache to across restarts, how many threads to use writing the cache, etc. See the link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/io/hfile/CacheConfig.html[CacheConfig.html] class for configuration options and descriptions. @@ -876,9 +891,10 @@ The following example configures buckets of size 4096 and 8192. [NOTE] ==== The default maximum direct memory varies by JVM. -Traditionally it is 64M or some relation to allocated heap size (-Xmx) or no limit at all (JDK7 apparently). HBase servers use direct memory, in particular short-circuit reading, the hosted DFSClient will allocate direct memory buffers. +Traditionally it is 64M or some relation to allocated heap size (-Xmx) or no limit at all (JDK7 apparently). HBase servers use direct memory, in particular short-circuit reading, the hosted DFSClient will allocate direct memory buffers. How much the DFSClient uses is not easy to quantify; it is the number of open HFiles * `hbase.dfs.client.read.shortcircuit.buffer.size` where `hbase.dfs.client.read.shortcircuit.buffer.size` is set to 128k in HBase -- see _hbase-default.xml_ default configurations. If you do off-heap block caching, you'll be making use of direct memory. -Starting your JVM, make sure the `-XX:MaxDirectMemorySize` setting in _conf/hbase-env.sh_ is set to some value that is higher than what you have allocated to your off-heap BlockCache (`hbase.bucketcache.size`). It should be larger than your off-heap block cache and then some for DFSClient usage (How much the DFSClient uses is not easy to quantify; it is the number of open HFiles * `hbase.dfs.client.read.shortcircuit.buffer.size` where `hbase.dfs.client.read.shortcircuit.buffer.size` is set to 128k in HBase -- see _hbase-default.xml_ default configurations). Direct memory, which is part of the Java process heap, is separate from the object heap allocated by -Xmx. +The RPCServer uses a ByteBuffer pool. From 2.0.0, these buffers are off heap ByteBuffers. +Starting your JVM, make sure the `-XX:MaxDirectMemorySize` setting in _conf/hbase-env.sh_ considers off-heap BlockCache (`hbase.bucketcache.size`), DFSClient usage, RPC side ByteBufferPool max size. This has to be bit higher than sum of off heap BlockCache size and max ByteBufferPool size. Allocating an extra of 1 - 2 GB for the max direct memory size has worked in tests. Direct memory, which is part of the Java process heap, is separate from the object heap allocated by -Xmx. The value allocated by `MaxDirectMemorySize` must not exceed physical RAM, and is likely to be less than the total available RAM due to other memory requirements and system constraints. You can see how much memory -- on-heap and off-heap/direct -- a RegionServer is configured to use and how much it is using at any one time by looking at the _Server Metrics: Memory_ tab in the UI. @@ -898,9 +914,25 @@ If the deploy was using CombinedBlockCache, then the LruBlockCache L1 size was c where size-of-bucket-cache itself is EITHER the value of the configuration `hbase.bucketcache.size` IF it was specified as Megabytes OR `hbase.bucketcache.size` * `-XX:MaxDirectMemorySize` if `hbase.bucketcache.size` is between 0 and 1.0. In 1.0, it should be more straight-forward. -L1 LruBlockCache size is set as a fraction of java heap using `hfile.block.cache.size setting` (not the best name) and L2 is set as above either in absolute Megabytes or as a fraction of allocated maximum direct memory. +Onheap LruBlockCache size is set as a fraction of java heap using `hfile.block.cache.size setting` (not the best name) and BucketCache is set as above in absolute Megabytes. ==== +[[RPCServer.ByteBufferPool]] +==== RPCServer ByteBufferPool +The buffers from this pool will be used to accumulate the cell bytes and create a result cell block to send back to the client side. +`hbase.ipc.server.reservoir.enabled` can be used to turn this pool ON or OFF. By default this pool is ON and available. HBase will create off heap ByteBuffers and pool them. Please make sure not to turn this OFF if you want E2E off heaping in read path. +If this pool is turned off, the server will create temp buffers on heap to accumulate the cell bytes and make a result cell block. This can impact the GC on a highly read loaded server. +The user can tune this pool with respect to how many buffers are in the pool and what should be the size of each ByteBuffer. +Use the config `hbase.ipc.server.reservoir.initial.buffer.size` to tune each of the buffer sizes. Defaults is 64 KB. + +When the read pattern is a random row read and each of the rows are smaller in size compared to this 64 KB, try reducing this. When the result size is larger than one ByteBuffer size, the server will try to grab more than one buffer and make a result cell block out of these. When the pool is running out of buffers, the server will end up creating temporary on-heap buffers. + +The maximum number of ByteBuffers in the pool can be tuned using the config `hbase.ipc.server.reservoir.initial.max`. Its value defaults to 64 * region server handlers configured (See the config `hbase.regionserver.handler.count`). The math is such that by default we consider 2 MB as the result cell block size per read result and each handler will be handling a read. For 2 MB size, we need 32 buffers each of size 64 KB. So per handler 32 ByteBuffers(BB). We allocate twice this size as the max BBs count such that one handler can be creating the response and handing it to the RPC Responder thread and then handling a new request creating a new response cell block (using pooled buffers). Even if the responder could not send back the first TCP reply immediately, our count should allow that we should still have enough buffers in our pool without having to make temporary buffers on the heap. Again for smaller sized random row reads, tune this max count. These are lazily created buffers and the count is the max count to be pooled. +If you still see GC issues even after making E2E read path off heap, look for issues in the appropriate buffer pool. Check the below RS log with INFO level: +[source] +---- +Pool already reached its max capacity : XXX and no free buffers now. Consider increasing the value for 'hbase.ipc.server.reservoir.initial.max' ? +---- ==== Compressed BlockCache link:https://issues.apache.org/jira/browse/HBASE-11331[HBASE-11331] introduced lazy BlockCache decompression, more simply referred to as compressed BlockCache.