Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-9737

Corrupt HFile cause resource leak leading to Region Server OOM

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 0.94.12
    • Fix Version/s: 0.98.0, 0.94.13, 0.96.1
    • Component/s: HFile
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      One of our customer was recently hit with OOM error on almost all of the region servers.

      Postmortem of the issue reveled that a corrupt HFile had made its way into one of the regions which resulted into the region brought offline immediately which is as per design.

      What happened next reveals two issues:

      • As soon as the region was offlined, Master noticed this and tried to assign the region to another region server which of course failed (again due to the corrupt HFile) and then Master tried to assign this to another and so on. So this region kept bouncing from one server to another and this went unnoticed for few hours and all region servers log were filled with thousands of this message:
        org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed open of
        region=userdata,50743646010,1378139055806.318c533716869574f10615703269497f.,
        starting to roll back the global memstore size.
        java.io.IOException: java.io.IOException:
        org.apache.hadoop.hbase.io.hfile.CorruptHFileException: Problem reading HFile
        Trailer from file
        /hbase/userdata/318c533716869574f10615703269497f/data/a3e2ae39f71441ac92a6563479fb976e
                at org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:550)
                at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:463)
                at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3835)
                at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3783)
                at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:332)
                at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:108)
                at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:169)
                at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
                at java.lang.Thread.run(Thread.java:662)
        Caused by: java.io.IOException:
        org.apache.hadoop.hbase.io.hfile.CorruptHFileException: Problem reading HFile
        Trailer from file
        /hbase/userdata/318c533716869574f10615703269497f/data/a3e2ae39f71441ac92a6563479fb976e
                at org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:404)
                at org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:257)
                at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:3017)
                at org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:525)
                at org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:523)
                at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
                at java.util.concurrent.FutureTask.run(FutureTask.java:138)
                at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
                at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
                at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        

        For situation like this, the region should be marked "offlined_with_error" or something similar so that Master does not try to assign it to another server without user fixing the issue. I will create a separate JIRA for that.

      • The second problem and the scope of this JIRA is that the function org.apache.hadoop.hbase.io.hfile.HFile.pickReaderVersion() throws exception without closing the FSDataInputStream objects even if closeIStream is set to true. This lead to orphan filesystem streams accumulating in region server and it eventually died of OOM.
      1. 9737.096.txt
        5 kB
        stack
      2. HBASE-9737_0.94.patch
        2 kB
        Aditya Kishore
      3. HBASE-9737_0.94.patch
        3 kB
        Aditya Kishore
      4. HBASE-9737_0.94.patch
        3 kB
        Aditya Kishore
      5. HBASE-9737.patch
        2 kB
        Aditya Kishore
      6. HBASE-9737.patch
        2 kB
        Aditya Kishore

        Issue Links

          Activity

          Hide
          adityakishore Aditya Kishore added a comment -

          Patch for 0.94 branch, code in trunk has changed so the same patch will not apply. Will attach trunk patch shortly.

          Show
          adityakishore Aditya Kishore added a comment - Patch for 0.94 branch, code in trunk has changed so the same patch will not apply. Will attach trunk patch shortly.
          Hide
          yuzhihong@gmail.com Ted Yu added a comment -

          Patch looks good overall.

          +      if (errorWhileOpening && closeIStream) {
          +        try {
          +          if (fsdis != fsdisNoFsChecksum && fsdisNoFsChecksum != null) {
          +            fsdisNoFsChecksum.close();
          +            fsdisNoFsChecksum = null;
          +          }
          +          if (fsdis != null) {
          

          Do we need two try/catch blocks above in case fsdisNoFsChecksum.close() throws exception ?

          Show
          yuzhihong@gmail.com Ted Yu added a comment - Patch looks good overall. + if (errorWhileOpening && closeIStream) { + try { + if (fsdis != fsdisNoFsChecksum && fsdisNoFsChecksum != null ) { + fsdisNoFsChecksum.close(); + fsdisNoFsChecksum = null ; + } + if (fsdis != null ) { Do we need two try/catch blocks above in case fsdisNoFsChecksum.close() throws exception ?
          Hide
          adityakishore Aditya Kishore added a comment -

          Thanks for reviewing Ted. Yes, agree. Updating patch.

          Show
          adityakishore Aditya Kishore added a comment - Thanks for reviewing Ted. Yes, agree. Updating patch.
          Hide
          adityakishore Aditya Kishore added a comment -

          Submitting patch for trunk to Hadoop QA.

          Show
          adityakishore Aditya Kishore added a comment - Submitting patch for trunk to Hadoop QA.
          Hide
          eclark Elliott Clark added a comment -

          Why not just move the close (wrapped in its own try-catch ) into the catch clause ?

          Show
          eclark Elliott Clark added a comment - Why not just move the close (wrapped in its own try-catch ) into the catch clause ?
          Hide
          adityakishore Aditya Kishore added a comment -

          That make much more sense, thanks Elliott. Will update the patches.

          Show
          adityakishore Aditya Kishore added a comment - That make much more sense, thanks Elliott. Will update the patches.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          Good catch! I agree this is critical.
          We do not need the finally, though, as it only acts when we caught an exception.
          So instead of

          ... 
          } catch (Throwable t) {
             error = true;
             throw ...
          } finally {
            if (error)  {
              close();
            }
          }
          

          We can write

          } catch (Throwable t) {
             close();
             throw ...
          }
          
          Show
          lhofhansl Lars Hofhansl added a comment - Good catch! I agree this is critical. We do not need the finally, though, as it only acts when we caught an exception. So instead of ... } catch (Throwable t) { error = true ; throw ... } finally { if (error) { close(); } } We can write } catch (Throwable t) { close(); throw ... }
          Hide
          lhofhansl Lars Hofhansl added a comment -

          Oops. Didn't see that Elliot suggested the same thing.

          Show
          lhofhansl Lars Hofhansl added a comment - Oops. Didn't see that Elliot suggested the same thing.
          Hide
          adityakishore Aditya Kishore added a comment -

          Addressing Elliott's comment.

          Show
          adityakishore Aditya Kishore added a comment - Addressing Elliott's comment.
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12607688/HBASE-9737_0.94.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/7509//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12607688/HBASE-9737_0.94.patch against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/7509//console This message is automatically generated.
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12607677/HBASE-9737.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 hadoop1.0. The patch compiles against the hadoop 1.0 profile.

          +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 lineLengths. The patch does not introduce lines longer than 100

          -1 site. The patch appears to cause mvn site goal to fail.

          +1 core tests. The patch passed unit tests in .

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/7507//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/7507//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/7507//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-thrift.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/7507//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/7507//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/7507//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/7507//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/7507//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/7507//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/7507//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/7507//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12607677/HBASE-9737.patch against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 hadoop1.0 . The patch compiles against the hadoop 1.0 profile. +1 hadoop2.0 . The patch compiles against the hadoop 2.0 profile. +1 javadoc . The javadoc tool did not generate any warning messages. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 lineLengths . The patch does not introduce lines longer than 100 -1 site . The patch appears to cause mvn site goal to fail. +1 core tests . The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/7507//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/7507//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/7507//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-thrift.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/7507//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/7507//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/7507//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/7507//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/7507//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/7507//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/7507//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/7507//console This message is automatically generated.
          Hide
          adityakishore Aditya Kishore added a comment -

          Lars Hofhansl Would you like to pull this in 0.94.13?

          Show
          adityakishore Aditya Kishore added a comment - Lars Hofhansl Would you like to pull this in 0.94.13?
          Hide
          lhofhansl Lars Hofhansl added a comment -

          Yes please.

          Show
          lhofhansl Lars Hofhansl added a comment - Yes please.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          The behavior is slightly different now, right?
          Now if there's an invalid HFile version we wrap an IllegalArgumentException inside a CorruptedHFileException. If that is of not concern to anybody, +1

          Show
          lhofhansl Lars Hofhansl added a comment - The behavior is slightly different now, right? Now if there's an invalid HFile version we wrap an IllegalArgumentException inside a CorruptedHFileException. If that is of not concern to anybody, +1
          Hide
          adityakishore Aditya Kishore added a comment -

          I looked at the callers of the function down few levels and none of them seem to look at the type of exception thrown.

          Show
          adityakishore Aditya Kishore added a comment - I looked at the callers of the function down few levels and none of them seem to look at the type of exception thrown.
          Hide
          stack stack added a comment -

          The 0.96 patch I applied.

          Show
          stack stack added a comment - The 0.96 patch I applied.
          Hide
          stack stack added a comment -

          Committed to 0.94 (hope that is what you wanted Lars), 0.96 and trunk. Thanks Aditya.

          Show
          stack stack added a comment - Committed to 0.94 (hope that is what you wanted Lars), 0.96 and trunk. Thanks Aditya.
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in HBase-TRUNK #4638 (See https://builds.apache.org/job/HBase-TRUNK/4638/)
          HBASE-9737 Corrupt HFile cause resource leak leading to Region Server OOM (stack: rev 1534850)

          • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in HBase-TRUNK #4638 (See https://builds.apache.org/job/HBase-TRUNK/4638/ ) HBASE-9737 Corrupt HFile cause resource leak leading to Region Server OOM (stack: rev 1534850) /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in hbase-0.96-hadoop2 #97 (See https://builds.apache.org/job/hbase-0.96-hadoop2/97/)
          HBASE-9737 Corrupt HFile cause resource leak leading to Region Server OOM (stack: rev 1534854)

          • /hbase/branches/0.96/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in hbase-0.96-hadoop2 #97 (See https://builds.apache.org/job/hbase-0.96-hadoop2/97/ ) HBASE-9737 Corrupt HFile cause resource leak leading to Region Server OOM (stack: rev 1534854) /hbase/branches/0.96/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in HBase-0.94-security #321 (See https://builds.apache.org/job/HBase-0.94-security/321/)
          HBASE-9737 Corrupt HFile cause resource leak leading to Region Server OOM (stack: rev 1534855)

          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in HBase-0.94-security #321 (See https://builds.apache.org/job/HBase-0.94-security/321/ ) HBASE-9737 Corrupt HFile cause resource leak leading to Region Server OOM (stack: rev 1534855) /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in hbase-0.96 #154 (See https://builds.apache.org/job/hbase-0.96/154/)
          HBASE-9737 Corrupt HFile cause resource leak leading to Region Server OOM (stack: rev 1534854)

          • /hbase/branches/0.96/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in hbase-0.96 #154 (See https://builds.apache.org/job/hbase-0.96/154/ ) HBASE-9737 Corrupt HFile cause resource leak leading to Region Server OOM (stack: rev 1534854) /hbase/branches/0.96/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in HBase-0.94 #1180 (See https://builds.apache.org/job/HBase-0.94/1180/)
          HBASE-9737 Corrupt HFile cause resource leak leading to Region Server OOM (stack: rev 1534855)

          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in HBase-0.94 #1180 (See https://builds.apache.org/job/HBase-0.94/1180/ ) HBASE-9737 Corrupt HFile cause resource leak leading to Region Server OOM (stack: rev 1534855) /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #805 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/805/)
          HBASE-9737 Corrupt HFile cause resource leak leading to Region Server OOM (stack: rev 1534850)

          • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #805 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/805/ ) HBASE-9737 Corrupt HFile cause resource leak leading to Region Server OOM (stack: rev 1534850) /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java

            People

            • Assignee:
              adityakishore Aditya Kishore
              Reporter:
              adityakishore Aditya Kishore
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development