HBase
  1. HBase
  2. HBASE-706

On OOME, regionserver sticks around and doesn't go down with cluster

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.2.0
    • Component/s: None
    • Labels:
      None

      Description

      On John Gray cluster, an errant, massive, store file caused us OOME. Shutdown of cluster left this regionserver in place. A thread dump failed with OOME. Here is last thing in log:

      2008-06-25 03:21:55,111 INFO org.apache.hadoop.hbase.HRegionServer: worker thread exiting
      2008-06-25 03:24:26,923 FATAL org.apache.hadoop.hbase.HRegionServer: Set stop flag in regionserver/0:0:0:0:0:0:0:0:60020.cacheFlusher
      java.lang.OutOfMemoryError: Java heap space
              at java.util.HashMap.<init>(HashMap.java:226)
              at java.util.HashSet.<init>(HashSet.java:103)
              at org.apache.hadoop.hbase.HRegionServer.getRegionsToCheck(HRegionServer.java:1789)
              at org.apache.hadoop.hbase.HRegionServer$Flusher.enqueueOptionalFlushRegions(HRegionServer.java:479)
              at org.apache.hadoop.hbase.HRegionServer$Flusher.run(HRegionServer.java:385)
      2008-06-25 03:24:26,923 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 60020, call batchUpdate(items,,1214272763124, 9223372036854775807, org.apache.hadoop.hbase.io.BatchUpdate@67d6b1e2) from 192.168.249.230:38278: error: java.io.IOException: Server not running
      java.io.IOException: Server not running
              at org.apache.hadoop.hbase.HRegionServer.checkOpen(HRegionServer.java:1758)
              at org.apache.hadoop.hbase.HRegionServer.batchUpdate(HRegionServer.java:1547)
              at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
              at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
              at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
              at java.lang.reflect.Method.invoke(Method.java:616)
              at org.apache.hadoop.hbase.ipc.HbaseRPC$Server.call(HbaseRPC.java:413)
              at org.apache.hadoop.ipc.Server$Handler.run(Server.java:901)
      

      If I get an OOME just trying to threaddump, would seem to indicate we need to start keeping a little memory resevoir around for emergencies such as this just so we can shutdown clean.

      Moving this into 0.2. Seems important to fix if robustness is name of the game.

      1. hbase-706-v1.patch
        4 kB
        Jean-Daniel Cryans
      2. loader.jsp
        3 kB
        Jean-Daniel Cryans

        Issue Links

          Activity

          Jim Kellerman made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          stack made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Jean-Daniel Cryans made changes -
          Attachment hbase-706-v1.patch [ 12385493 ]
          Jean-Daniel Cryans made changes -
          Attachment hbase-706-v1.patch [ 12385491 ]
          Jean-Daniel Cryans made changes -
          Attachment hbase-706-v1.patch [ 12385491 ]
          Jean-Daniel Cryans made changes -
          Attachment loader.jsp [ 12385489 ]
          Jean-Daniel Cryans made changes -
          Assignee Jean-Daniel Cryans [ jdcryans ]
          Jonathan Gray made changes -
          Field Original Value New Value
          Link This issue is related to HBASE-707 [ HBASE-707 ]
          stack created issue -

            People

            • Assignee:
              Jean-Daniel Cryans
              Reporter:
              stack
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development