Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-706

On OOME, regionserver sticks around and doesn't go down with cluster


    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.2.0
    • Component/s: None
    • Labels:


      On John Gray cluster, an errant, massive, store file caused us OOME. Shutdown of cluster left this regionserver in place. A thread dump failed with OOME. Here is last thing in log:

      2008-06-25 03:21:55,111 INFO org.apache.hadoop.hbase.HRegionServer: worker thread exiting
      2008-06-25 03:24:26,923 FATAL org.apache.hadoop.hbase.HRegionServer: Set stop flag in regionserver/0:0:0:0:0:0:0:0:60020.cacheFlusher
      java.lang.OutOfMemoryError: Java heap space
              at java.util.HashMap.<init>(HashMap.java:226)
              at java.util.HashSet.<init>(HashSet.java:103)
              at org.apache.hadoop.hbase.HRegionServer.getRegionsToCheck(HRegionServer.java:1789)
              at org.apache.hadoop.hbase.HRegionServer$Flusher.enqueueOptionalFlushRegions(HRegionServer.java:479)
              at org.apache.hadoop.hbase.HRegionServer$Flusher.run(HRegionServer.java:385)
      2008-06-25 03:24:26,923 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 60020, call batchUpdate(items,,1214272763124, 9223372036854775807, org.apache.hadoop.hbase.io.BatchUpdate@67d6b1e2) from error: java.io.IOException: Server not running
      java.io.IOException: Server not running
              at org.apache.hadoop.hbase.HRegionServer.checkOpen(HRegionServer.java:1758)
              at org.apache.hadoop.hbase.HRegionServer.batchUpdate(HRegionServer.java:1547)
              at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
              at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
              at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
              at java.lang.reflect.Method.invoke(Method.java:616)
              at org.apache.hadoop.hbase.ipc.HbaseRPC$Server.call(HbaseRPC.java:413)
              at org.apache.hadoop.ipc.Server$Handler.run(Server.java:901)

      If I get an OOME just trying to threaddump, would seem to indicate we need to start keeping a little memory resevoir around for emergencies such as this just so we can shutdown clean.

      Moving this into 0.2. Seems important to fix if robustness is name of the game.


        1. hbase-706-v1.patch
          4 kB
          Jean-Daniel Cryans
        2. loader.jsp
          3 kB
          Jean-Daniel Cryans

          Issue Links



              • Assignee:
                jdcryans Jean-Daniel Cryans
                stack stack
              • Votes:
                0 Vote for this issue
                0 Start watching this issue


                • Created: