Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-16616

Rpc handlers stuck on ThreadLocalMap.expungeStaleEntry

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.2.2
    • 1.4.0, 2.0.0
    • Performance
    • None
    • Reviewed

    Description

      In our HBase 1.2.2 cluster, some regionserver showed too bad "QueueCallTime_99th_percentile" exceeding 10 seconds.
      Most rpc handler threads stuck on ThreadLocalMap.expungeStaleEntry call at that time.

      "PriorityRpcServer.handler=18,queue=0,port=16020" #322 daemon prio=5 os_prio=0 tid=0x00007fd422062800 nid=0x19b89 runnable [0x00007fcb8a821000]
         java.lang.Thread.State: RUNNABLE
              at java.lang.ThreadLocal$ThreadLocalMap.expungeStaleEntry(ThreadLocal.java:617)
              at java.lang.ThreadLocal$ThreadLocalMap.remove(ThreadLocal.java:499)
              at java.lang.ThreadLocal$ThreadLocalMap.access$200(ThreadLocal.java:298)
              at java.lang.ThreadLocal.remove(ThreadLocal.java:222)
              at java.util.concurrent.locks.ReentrantReadWriteLock$Sync.tryReleaseShared(ReentrantReadWriteLock.java:426)
              at java.util.concurrent.locks.AbstractQueuedSynchronizer.releaseShared(AbstractQueuedSynchronizer.java:1341)
              at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.unlock(ReentrantReadWriteLock.java:881)
              at com.yammer.metrics.stats.ExponentiallyDecayingSample.unlockForRegularUsage(ExponentiallyDecayingSample.java:196)
              at com.yammer.metrics.stats.ExponentiallyDecayingSample.update(ExponentiallyDecayingSample.java:113)
              at com.yammer.metrics.stats.ExponentiallyDecayingSample.update(ExponentiallyDecayingSample.java:81)
              at org.apache.hadoop.metrics2.lib.MutableHistogram.add(MutableHistogram.java:81)
              at org.apache.hadoop.metrics2.lib.MutableRangeHistogram.add(MutableRangeHistogram.java:59)
              at org.apache.hadoop.hbase.ipc.MetricsHBaseServerSourceImpl.dequeuedCall(MetricsHBaseServerSourceImpl.java:194)
              at org.apache.hadoop.hbase.ipc.MetricsHBaseServer.dequeuedCall(MetricsHBaseServer.java:76)
              at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2192)
              at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112)
              at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:133)
              at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:108)
              at java.lang.Thread.run(Thread.java:745)
      

      We were using jdk 1.8.0_92 and here is a snippet from ThreadLocal.java.

      616:    while (tab[h] != null)
      617:        h = nextIndex(h, len);
      

      So I hypothesized that there're too many consecutive entries in tab array and actually I found them in the heapdump.

      Most of these entries pointed at instance of org.apache.hadoop.hbase.util.Counter$1
      which is equivarent to indexHolderThreadLocal instance-variable in the Counter class.

      Because RpcServer$Connection class creates a Counter instance rpcCount for every connections,
      it is possible to have lots of Counter#indexHolderThreadLocal instances in RegionServer process
      when we repeat connect-and-close from client. As a result, a ThreadLocalMap can have lots of consecutive
      entires.

      Usually, since each entry is a WeakReference, these entries are collected and removed
      by garbage-collector soon after connection closed.
      But if connection's life-time was long enough to survive youngGC, it wouldn't be collected until old-gen collector runs.
      Furthermore, under G1GC deployment, it is possible not to be collected even by old-gen GC(mixed GC)
      if entries sit in a region which doesn't have much garbages.
      Actually we used G1GC when we encountered this problem.

      We should remove the entry from ThreadLocalMap by calling ThreadLocal#remove explicitly.

      Attachments

        1. ScreenShot 2016-09-09 14.17.53.png
          892 kB
          Tomu Tsuruhara
        2. HBASE-16616.master.001.patch
          2 kB
          Tomu Tsuruhara
        3. HBASE-16616.master.002.patch
          2 kB
          Tomu Tsuruhara
        4. 16616.branch-1.v2.txt
          2 kB
          Ted Yu

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            tomu.tsuruhara Tomu Tsuruhara
            tomu.tsuruhara Tomu Tsuruhara
            Votes:
            1 Vote for this issue
            Watchers:
            16 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment