Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-4769

Abort RegionServer Immediately on OOME

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.94.0
    • Fix Version/s: 0.92.0, 0.94.0
    • Component/s: None
    • Labels:
      None

      Description

      Currently, when the HRegionServer runs out of the memory, it will call master, which will cause more heap allocations and throw a second exception that it's run out of memory again. The easiest & safest way to avoid this OOME storm is to abort the RegionServer immediately when it hits the memory boundary. Part of the 89-fb to trunk port.

      1. HBASE-4769.patch
        2 kB
        Nicolas Spiegelberg
      2. HBASE-4769.patch
        2 kB
        Nicolas Spiegelberg

        Activity

        Hide
        hadoopqa Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12503296/HBASE-4769.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/228//console

        This message is automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12503296/HBASE-4769.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/228//console This message is automatically generated.
        Hide
        stack stack added a comment -

        +1 on patch (Fix 'itseft' on commit).

        Outside of this patch do we need to fix our 'abort'? Seems odd that we have 'abort' and then this more radical 'abort', here its called 'forceAbort', where we call halt (Maybe we should have a 'halt' method? This patch would use it and I believe there is another halt call over in the distributed code?).

        Show
        stack stack added a comment - +1 on patch (Fix 'itseft' on commit). Outside of this patch do we need to fix our 'abort'? Seems odd that we have 'abort' and then this more radical 'abort', here its called 'forceAbort', where we call halt (Maybe we should have a 'halt' method? This patch would use it and I believe there is another halt call over in the distributed code?).
        Hide
        nspiegelberg Nicolas Spiegelberg added a comment -

        It looks like the other locations just call Runtime.getRuntime().halt(1) directly. Maybe we should do the same? BTW : Patch originally done by Liyin & reviewed by Kannan, so I'm not 100% sure what their reasoning was.

        Show
        nspiegelberg Nicolas Spiegelberg added a comment - It looks like the other locations just call Runtime.getRuntime().halt(1) directly. Maybe we should do the same? BTW : Patch originally done by Liyin & reviewed by Kannan, so I'm not 100% sure what their reasoning was.
        Hide
        stack stack added a comment -

        k

        Lets get it in for now. Can talk up 'halt' method elsewhere.

        Show
        stack stack added a comment - k Lets get it in for now. Can talk up 'halt' method elsewhere.
        Hide
        nspiegelberg Nicolas Spiegelberg added a comment -

        maybe we should put this in 0.92 as well?

        Show
        nspiegelberg Nicolas Spiegelberg added a comment - maybe we should put this in 0.92 as well?
        Hide
        hudson Hudson added a comment -

        Integrated in HBase-TRUNK #2427 (See https://builds.apache.org/job/HBase-TRUNK/2427/)
        HBASE-4769 Abort RegionServer Immediately on OOME

        nspiegelberg :
        Files :

        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
        Show
        hudson Hudson added a comment - Integrated in HBase-TRUNK #2427 (See https://builds.apache.org/job/HBase-TRUNK/2427/ ) HBASE-4769 Abort RegionServer Immediately on OOME nspiegelberg : Files : /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
        Hide
        stack stack added a comment -

        Committed to 0.92 too.

        Show
        stack stack added a comment - Committed to 0.92 too.
        Hide
        zeph Guido Serra aka Zeph added a comment -

        guys... this is so stupid... I lost the whole morning cause HBase's RegionServer was dying with no logs, no nothing... how Am I supposed to debug the issue if u do not even generate a core dump? or a log message? ... argh

        Show
        zeph Guido Serra aka Zeph added a comment - guys... this is so stupid... I lost the whole morning cause HBase's RegionServer was dying with no logs, no nothing... how Am I supposed to debug the issue if u do not even generate a core dump? or a log message? ... argh
        Hide
        ram_krish ramkrishna.s.vasudevan added a comment -

        I think adding this config to the JVM args will give you a heap dump
        -XX:+HeapDumpOnOutOfMemoryError.

        Show
        ram_krish ramkrishna.s.vasudevan added a comment - I think adding this config to the JVM args will give you a heap dump -XX:+HeapDumpOnOutOfMemoryError.

          People

          • Assignee:
            nspiegelberg Nicolas Spiegelberg
            Reporter:
            nspiegelberg Nicolas Spiegelberg
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development