We've upgraded our production cluster from 0.94.15 to 0.96.2 few days ago and observed increased GC frequency and occasionally full GC (we never had full GC before with G1 GC), which leads to famous juliet pause...
After digging into several HBase metrics, we've found that block cache used much higher memory in 0.96. It turns out due to patch:
HBASE-6312, which not only make a few block cache parameter configurable, but also change their default values! It is obvious that we need to set these parameters back to the old value before considering reduce block cache size or tuning our GC. However, we are surprised that there is no change in regionserver side and we are still observing high block cache usage.
At the end of the day, it seems in CacheConfig.java, we initialize LruBlockCache with default constructor: LruBlockCache(long maxSize, long blockSize), which underlying always use the default values. We think this is a bug and we should always use another constructor: LruBlockCache(long maxSize, long blockSize, boolean evictionThread, Configuration conf) in CacheConfig.java
We made the change and tested on one of our servers, it works and now GC problem disappears. Of course, we have to review our hbase and GC configurations and find the best configuration under 0.96 for our application. But first, we feel the constructor misuse in CacheConfig.java should be fixed.