Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-706

On OOME, regionserver sticks around and doesn't go down with cluster

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.2.0
    • None
    • None

    Description

      On John Gray cluster, an errant, massive, store file caused us OOME. Shutdown of cluster left this regionserver in place. A thread dump failed with OOME. Here is last thing in log:

      2008-06-25 03:21:55,111 INFO org.apache.hadoop.hbase.HRegionServer: worker thread exiting
      2008-06-25 03:24:26,923 FATAL org.apache.hadoop.hbase.HRegionServer: Set stop flag in regionserver/0:0:0:0:0:0:0:0:60020.cacheFlusher
      java.lang.OutOfMemoryError: Java heap space
              at java.util.HashMap.<init>(HashMap.java:226)
              at java.util.HashSet.<init>(HashSet.java:103)
              at org.apache.hadoop.hbase.HRegionServer.getRegionsToCheck(HRegionServer.java:1789)
              at org.apache.hadoop.hbase.HRegionServer$Flusher.enqueueOptionalFlushRegions(HRegionServer.java:479)
              at org.apache.hadoop.hbase.HRegionServer$Flusher.run(HRegionServer.java:385)
      2008-06-25 03:24:26,923 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 60020, call batchUpdate(items,,1214272763124, 9223372036854775807, org.apache.hadoop.hbase.io.BatchUpdate@67d6b1e2) from 192.168.249.230:38278: error: java.io.IOException: Server not running
      java.io.IOException: Server not running
              at org.apache.hadoop.hbase.HRegionServer.checkOpen(HRegionServer.java:1758)
              at org.apache.hadoop.hbase.HRegionServer.batchUpdate(HRegionServer.java:1547)
              at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
              at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
              at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
              at java.lang.reflect.Method.invoke(Method.java:616)
              at org.apache.hadoop.hbase.ipc.HbaseRPC$Server.call(HbaseRPC.java:413)
              at org.apache.hadoop.ipc.Server$Handler.run(Server.java:901)
      

      If I get an OOME just trying to threaddump, would seem to indicate we need to start keeping a little memory resevoir around for emergencies such as this just so we can shutdown clean.

      Moving this into 0.2. Seems important to fix if robustness is name of the game.

      Attachments

        1. hbase-706-v1.patch
          4 kB
          Jean-Daniel Cryans
        2. loader.jsp
          3 kB
          Jean-Daniel Cryans

        Issue Links

          Activity

            People

              jdcryans Jean-Daniel Cryans
              stack Michael Stack
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: