HBase
  1. HBase
  2. HBASE-4633

Potential memory leak in client RPC timeout mechanism

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.90.3
    • Fix Version/s: None
    • Component/s: Client
    • Labels:
      None
    • Environment:

      HBase version: 0.90.3 + Patches , Hadoop version: CDH3u0

      Description

      Relevant Jiras: https://issues.apache.org/jira/browse/HBASE-2937,
      https://issues.apache.org/jira/browse/HBASE-4003

      We have been using the 'hbase.client.operation.timeout' knob
      introduced in 2937 for quite some time now. It helps us enforce SLA.
      We have two HBase clusters and two HBase client clusters. One of them
      is much busier than the other.

      We have seen a deterministic behavior of clients running in busy
      cluster. Their (client's) memory footprint increases consistently
      after they have been up for roughly 24 hours.
      This memory footprint almost doubles from its usual value (usual case
      == RPC timeout disabled). After much investigation nothing concrete
      came out and we had to put a hack
      which keep heap size in control even when RPC timeout is enabled. Also
      note , the same behavior is not observed in 'not so busy
      cluster.

      The patch is here : https://gist.github.com/1288023

        Issue Links

          Activity

          Hide
          Asaf Mesika added a comment -

          If running with -Xmx6GB our application running HBase Client, then we are witnessing process resident memory increasing to 10GB. When placing -X:MaxDirectMemorySize=2g, we're seeing WARN OutOfMemoryException, but no error since the client has 10 retries. Our next step is to run it with profiler to diagnose why so much direct memory is used. Want to make sure its HBaseClient code and not some other third party lib we're using (although we're pretty sure its the only one calling ByteBuffer.allocateDirect transitively).

          Show
          Asaf Mesika added a comment - If running with -Xmx6GB our application running HBase Client, then we are witnessing process resident memory increasing to 10GB. When placing -X:MaxDirectMemorySize=2g, we're seeing WARN OutOfMemoryException, but no error since the client has 10 retries. Our next step is to run it with profiler to diagnose why so much direct memory is used. Want to make sure its HBaseClient code and not some other third party lib we're using (although we're pretty sure its the only one calling ByteBuffer.allocateDirect transitively).
          Hide
          stack added a comment -

          Asaf Mesika What is going on? You having mem issues? The 8k buffer size trick ain't cutting it for you?

          Show
          stack added a comment - Asaf Mesika What is going on? You having mem issues? The 8k buffer size trick ain't cutting it for you?
          Hide
          Asaf Mesika added a comment -

          Just adding the comments made in #4956 to this issue for who ever read this issue first:
          In Jonathan stack-trace you can see that BufferedOutputStream is used. This constitute that the writes are made 8k at a time, thus the ThreadLocal buffers suggested here should be no more than 8k, thus can't possibly lead to the memory leak.

          Show
          Asaf Mesika added a comment - Just adding the comments made in #4956 to this issue for who ever read this issue first: In Jonathan stack-trace you can see that BufferedOutputStream is used. This constitute that the writes are made 8k at a time, thus the ThreadLocal buffers suggested here should be no more than 8k, thus can't possibly lead to the memory leak.
          Hide
          Asaf Mesika added a comment -

          So eventually, this issue wasn't solved? We're experiencing this brutally in production with 0.94.7

          Show
          Asaf Mesika added a comment - So eventually, this issue wasn't solved? We're experiencing this brutally in production with 0.94.7
          Show
          stack added a comment - For completeness, here is Jonathan Payne's note on this phenomeon: http://grokbase.com/t/hbase.apache.org/dev/2011/10/interesting-note-on-hbase-client-from-asynchbase-list-fwd-standard-hbase-client-asynchbase-client-netty-and-direct-memory-buffers/22e7lummyzqwfdgh35n6hxe7ubxm
          Hide
          stack added a comment -

          Default -XXMaxDirectMemorySize value is equal to -Xmx value....

          According to this, http://stackoverflow.com/questions/3773775/default-for-xxmaxdirectmemorysize, default max is 64M (which is still too big). Should we set this to 1M as our default?

          Show
          stack added a comment - Default -XXMaxDirectMemorySize value is equal to -Xmx value.... According to this, http://stackoverflow.com/questions/3773775/default-for-xxmaxdirectmemorysize , default max is 64M (which is still too big). Should we set this to 1M as our default?
          Hide
          gaojinchao added a comment -

          I have tested, Memory does not increase when specified MaxDirectMemorySize with moderate value.

          In my cluster, nearly one hours , trigger a full GC. look this logs:
          10022.210: [Full GC (System) 10022.210: [Tenured: 577566K->257349K(1048576K), 1.7515610 secs] 9651924K->257349K(14260672K), [Perm : 19161K->19161K(65536K)], 1.7518350 secs] [Times: user=1.75 sys=0.00, real=1.75 secs] .........

          .........
          13532.930: [GC 13532.931: [ParNew: 12801558K->981626K(13212096K), 0.1414370 secs] 13111752K->1291828K(14260672K), 0.1416880 secs] [Times: user=1.90 sys=0.01, real=0.14 secs]
          13624.630: [Full GC (System) 13624.630: [Tenured: 310202K->175378K(1048576K), 1.9529280 secs] 11581276K->175378K(14260672K), [Perm : 19225K->19225K(65536K)], 1.9531660 secs]
          [Times: user=1.94 sys=0.00, real=1.96 secs]

          I monitored the memory. It is stable.

          7543 root 20 0 16.9g 15g 9892 S 1 33.0 1258:59 java
          7543 root 20 0 16.9g 15g 9892 S 0 33.0 1258:59 java
          7543 root 20 0 16.9g 15g 9892 S 1 33.0 1258:59 java
          7543 root 20 0 16.9g 15g 9892 S 0 33.0 1258:59 java
          7543 root 20 0 16.9g 15g 9892 S 1 33.0 1258:59 java
          7543 root 20 0 16.9g 15g 9892 S 1 33.0 1259:00 java

          Show
          gaojinchao added a comment - I have tested, Memory does not increase when specified MaxDirectMemorySize with moderate value. In my cluster, nearly one hours , trigger a full GC. look this logs: 10022.210: [Full GC (System) 10022.210: [Tenured: 577566K->257349K(1048576K), 1.7515610 secs] 9651924K->257349K(14260672K), [Perm : 19161K->19161K(65536K)] , 1.7518350 secs] [Times: user=1.75 sys=0.00, real=1.75 secs] ......... ......... 13532.930: [GC 13532.931: [ParNew: 12801558K->981626K(13212096K), 0.1414370 secs] 13111752K->1291828K(14260672K), 0.1416880 secs] [Times: user=1.90 sys=0.01, real=0.14 secs] 13624.630: [Full GC (System) 13624.630: [Tenured: 310202K->175378K(1048576K), 1.9529280 secs] 11581276K->175378K(14260672K), [Perm : 19225K->19225K(65536K)] , 1.9531660 secs] [Times: user=1.94 sys=0.00, real=1.96 secs] I monitored the memory. It is stable. 7543 root 20 0 16.9g 15g 9892 S 1 33.0 1258:59 java 7543 root 20 0 16.9g 15g 9892 S 0 33.0 1258:59 java 7543 root 20 0 16.9g 15g 9892 S 1 33.0 1258:59 java 7543 root 20 0 16.9g 15g 9892 S 0 33.0 1258:59 java 7543 root 20 0 16.9g 15g 9892 S 1 33.0 1258:59 java 7543 root 20 0 16.9g 15g 9892 S 1 33.0 1259:00 java
          Hide
          gaojinchao added a comment -

          This issue appears to be a configuration problem:
          1. HBase client uses NIO(socket) API that uses the direct memory.
          2. Default -XXMaxDirectMemorySize value is equal to -Xmx value, So if there doesn't have "full gc", all direct memory can't reclaim. Unfortunately, using GC confiugre parameter of our client doesn't produce any "full gc

          Show
          gaojinchao added a comment - This issue appears to be a configuration problem: 1. HBase client uses NIO(socket) API that uses the direct memory. 2. Default -XXMaxDirectMemorySize value is equal to -Xmx value, So if there doesn't have "full gc", all direct memory can't reclaim. Unfortunately, using GC confiugre parameter of our client doesn't produce any "full gc
          Hide
          gaojinchao added a comment -

          Hbase version is 0.90.4 + patch.
          Cluseter number is 10
          One HBase client process includes 50 threads, So the max threads connect to the RS is (50 * RS number).

          I have noticed some memory leak problems in my HBase client.
          RES has increased to 27g
          PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
          12676 root 20 0 30.8g 27g 5092 S 2 57.5 587:57.76 /opt/java/jre/bin/java -Djava.library.path=lib/.

          But I am not sure the leak comes from HBase Client jar itself or just our client code.

          This is some parameters of jvm.
          :-Xms15g -Xmn12g -Xmx15g -XX:PermSize=64m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=65 -XX:+UseCMSCompactAtFullCollection -XX:CMSFullGCsBeforeCompaction=1 -XX:+CMSParallelRemarkEnabled

          Show
          gaojinchao added a comment - Hbase version is 0.90.4 + patch. Cluseter number is 10 One HBase client process includes 50 threads, So the max threads connect to the RS is (50 * RS number). I have noticed some memory leak problems in my HBase client. RES has increased to 27g PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 12676 root 20 0 30.8g 27g 5092 S 2 57.5 587:57.76 /opt/java/jre/bin/java -Djava.library.path=lib/. But I am not sure the leak comes from HBase Client jar itself or just our client code. This is some parameters of jvm. :-Xms15g -Xmn12g -Xmx15g -XX:PermSize=64m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=65 -XX:+UseCMSCompactAtFullCollection -XX:CMSFullGCsBeforeCompaction=1 -XX:+CMSParallelRemarkEnabled
          Hide
          gaojinchao added a comment -

          I attached the stack picture.

          Show
          gaojinchao added a comment - I attached the stack picture.
          Hide
          Shrijeet Paliwal added a comment -

          Recent updates:

          • In my case the leak/memory-hold is not in HBase client. I could not find enough evidence to conclude that. What I did find is, our application holds one heavy object in memory. This object is shared between threads. Every N minutes the application creates a new instance of this class. Unless any thread is still holding on to an old instance, all old instances are GCed in time. Hence in theory at any time there should be only one active instance of heavy object.
          • Under heavy load and client operation RPC timeout enabled, some threads get stuck. This causes multiple instances of heavy object. In turn heap grows.

          After reading client code multiple times I can not gather why there will be a case when application thread will get stuck for several minutes. We have safe guards to clean up calls 'forcefully' if they have been alive for more than rpc timeout interval.

          I had planned to update the title of Jira to reflect above finding but Gaojinchao observed something interesting at his end and so keeping title same for now. Gaojinchao's thread is here: http://search-hadoop.com/m/teczL8KvcH

          Show
          Shrijeet Paliwal added a comment - Recent updates: In my case the leak/memory-hold is not in HBase client. I could not find enough evidence to conclude that. What I did find is, our application holds one heavy object in memory. This object is shared between threads. Every N minutes the application creates a new instance of this class. Unless any thread is still holding on to an old instance, all old instances are GCed in time. Hence in theory at any time there should be only one active instance of heavy object. Under heavy load and client operation RPC timeout enabled, some threads get stuck. This causes multiple instances of heavy object. In turn heap grows. After reading client code multiple times I can not gather why there will be a case when application thread will get stuck for several minutes. We have safe guards to clean up calls 'forcefully' if they have been alive for more than rpc timeout interval. I had planned to update the title of Jira to reflect above finding but Gaojinchao observed something interesting at his end and so keeping title same for now. Gaojinchao's thread is here: http://search-hadoop.com/m/teczL8KvcH
          Hide
          stack added a comment -

          Dang.

          Show
          stack added a comment - Dang.
          Hide
          Shrijeet Paliwal added a comment -

          @Stack
          No we did not run with that flag. Also we never got to a point when application had to die cause of OOM. The reasons (I guess) are :

          1. We have GC flags to do garbage collection as fast as possible.
          2. The monitoring in place starts sending our alerts and we usually shoot the server in the head before it OOMs
          3. The load balancer will kick in and start sending no work to application server realizing it is in bad state.

          As mentioned earlier I have found it hard to reproduce in dev environment, failing to simulate production like load. But I must try again when.

          Show
          Shrijeet Paliwal added a comment - @Stack No we did not run with that flag. Also we never got to a point when application had to die cause of OOM. The reasons (I guess) are : We have GC flags to do garbage collection as fast as possible. The monitoring in place starts sending our alerts and we usually shoot the server in the head before it OOMs The load balancer will kick in and start sending no work to application server realizing it is in bad state. As mentioned earlier I have found it hard to reproduce in dev environment, failing to simulate production like load. But I must try again when.
          Hide
          stack added a comment -

          @Shrijeet I wonder if you ran the client with:

          -XX:-HeapDumpOnOutOfMemoryError	Dump heap to file when java.lang.OutOfMemoryError is thrown. Manageable. (Introduced in 1.4.2 update 12, 5.0 update 7.)
          

          ... if the dump would reveal anything interesting. I don't mind digging in heap if you get one (put it somewhere I can pull it).

          Show
          stack added a comment - @Shrijeet I wonder if you ran the client with: -XX:-HeapDumpOnOutOfMemoryError Dump heap to file when java.lang.OutOfMemoryError is thrown. Manageable. (Introduced in 1.4.2 update 12, 5.0 update 7.) ... if the dump would reveal anything interesting. I don't mind digging in heap if you get one (put it somewhere I can pull it).
          Hide
          Shrijeet Paliwal added a comment -

          @Liyin,
          Are you using RPC timeouts for client operations?

          But Not sure the leak comes from HBase Client jar itself or just our client code.

          In the absence of a concrete evidence that leak is indeed in HBase client jar, I have similar feeling. It could be in our application layer.

          Our symptom is that the memory footprint will increase as time. But the actual heap size of the client is not increasing.

          We observe the used memory using a collectd plugin http://collectd.org/wiki/index.php/Plugin:Memory

          So I am very interested to know when you have keep the heap size in control, is the memory leaking solved ?

          We run with max and min memory set as -Xmx2

          {X}G -Xms{X}

          G . And when 'leak' happens the plugin shows the used memory touching 2X value, so it does seem heap size is increasing. Correct me here if I am mistaken.

          Let me know if you need more inputs.

          Show
          Shrijeet Paliwal added a comment - @Liyin, Are you using RPC timeouts for client operations? But Not sure the leak comes from HBase Client jar itself or just our client code. In the absence of a concrete evidence that leak is indeed in HBase client jar, I have similar feeling. It could be in our application layer. Our symptom is that the memory footprint will increase as time. But the actual heap size of the client is not increasing. We observe the used memory using a collectd plugin http://collectd.org/wiki/index.php/Plugin:Memory So I am very interested to know when you have keep the heap size in control, is the memory leaking solved ? We run with max and min memory set as -Xmx2 {X}G -Xms{X} G . And when 'leak' happens the plugin shows the used memory touching 2X value, so it does seem heap size is increasing. Correct me here if I am mistaken. Let me know if you need more inputs.
          Hide
          Liyin Tang added a comment -

          I have also noticed some memory leak problems in HBase client.
          Our symptom is that the memory footprint will increase as time. But the actual heap size of the client is not increasing.
          The leak should come from non-heap memory.
          But Not sure the leak comes from HBase Client jar itself or just our client code.
          So I am very interested to know when you have keep the heap size in control, is the memory leaking solved ?

          Show
          Liyin Tang added a comment - I have also noticed some memory leak problems in HBase client. Our symptom is that the memory footprint will increase as time. But the actual heap size of the client is not increasing. The leak should come from non-heap memory. But Not sure the leak comes from HBase Client jar itself or just our client code. So I am very interested to know when you have keep the heap size in control, is the memory leaking solved ?

            People

            • Assignee:
              Unassigned
              Reporter:
              Shrijeet Paliwal
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:

                Development