Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-1378

Edit log replay should track and report file offsets in case of errors

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.22.0
    • Fix Version/s: 0.23.0, 1.1.0
    • Component/s: namenode
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Occasionally there are bugs or operational mistakes that result in corrupt edit logs which I end up having to repair by hand. In these cases it would be very handy to have the error message also print out the file offsets of the last several edit log opcodes so it's easier to find the right place to edit in the OP_INVALID marker. We could also use this facility to provide a rough estimate of how far along edit log replay the NN is during startup (handy when a 2NN has died and replay takes a while)

      1. hdfs-1378.0.patch
        36 kB
        Aaron T. Myers
      2. hdfs-1378.1.patch
        37 kB
        Aaron T. Myers
      3. hdfs-1378.2.txt
        38 kB
        Todd Lipcon
      4. HDFS-1378-b1.002.patch
        4 kB
        Colin Patrick McCabe
      5. HDFS-1378-b1.003.patch
        5 kB
        Colin Patrick McCabe
      6. HDFS-1378-b1.004.patch
        9 kB
        Colin Patrick McCabe
      7. hdfs-1378-branch20.txt
        4 kB
        Todd Lipcon

        Issue Links

          Activity

          Hide
          Todd Lipcon added a comment -

          Here's a patch for branch-20, not for commit.

          In trunk the code has been refactored a bit so that the edit log loading code directly gets a DataInputStream, so we can't do it quite the same way. I'd like to change EditLogInputStream to just return an InputStream rather than DataInputStream so that we can wrap it in a position tracker as done in this patch.

          Here's example output from an edit log that got corrupted due to the root disk running out of space:

          10/09/06 11:02:30 ERROR common.Storage: Error replaying edit log at offset 1698779
          10/09/06 11:02:30 ERROR common.Storage: Last 4 opcodes at offsets: 1629141 1629329 1629546 1698775
          10/09/06 11:02:30 ERROR namenode.FSNamesystem: FSNamesystem initialization failed.
          java.io.IOException: Incorrect data format. logVersion is -18 but writables.length is 0. 
          

          From here it's very easy to use bvi to figure out where truncation or corruption occurred and fix it up.

          Show
          Todd Lipcon added a comment - Here's a patch for branch-20, not for commit. In trunk the code has been refactored a bit so that the edit log loading code directly gets a DataInputStream, so we can't do it quite the same way. I'd like to change EditLogInputStream to just return an InputStream rather than DataInputStream so that we can wrap it in a position tracker as done in this patch. Here's example output from an edit log that got corrupted due to the root disk running out of space: 10/09/06 11:02:30 ERROR common.Storage: Error replaying edit log at offset 1698779 10/09/06 11:02:30 ERROR common.Storage: Last 4 opcodes at offsets: 1629141 1629329 1629546 1698775 10/09/06 11:02:30 ERROR namenode.FSNamesystem: FSNamesystem initialization failed. java.io.IOException: Incorrect data format. logVersion is -18 but writables.length is 0. From here it's very easy to use bvi to figure out where truncation or corruption occurred and fix it up.
          Hide
          Aaron T. Myers added a comment -

          Patch looks pretty solid, Todd, and very helpful. One comment:

          There are large classes of edits log corruptions which will result in some exception which is not an IOE being thrown. But, this debugging info is only printed in the event an IOE is thrown. I've twice now had to change this code to catch NPE and recompile to get it to print this info. Ideally I think we'd change things so that this stuff is in a "catch (Throwable t)" block, with the actual exception being re-thrown after printing.

          Show
          Aaron T. Myers added a comment - Patch looks pretty solid, Todd, and very helpful. One comment: There are large classes of edits log corruptions which will result in some exception which is not an IOE being thrown. But, this debugging info is only printed in the event an IOE is thrown. I've twice now had to change this code to catch NPE and recompile to get it to print this info. Ideally I think we'd change things so that this stuff is in a " catch (Throwable t) " block, with the actual exception being re-thrown after printing.
          Hide
          Aaron T. Myers added a comment -

          The patch is a little difficult to review because I indented a big block of code and diff didn't figure that out very well. The only code I changed between the try and catch was to add these two lines:

          +          recentOpcodeOffsets[numEdits % recentOpcodeOffsets.length] =
          +              tracker.getPos();
          
          Show
          Aaron T. Myers added a comment - The patch is a little difficult to review because I indented a big block of code and diff didn't figure that out very well. The only code I changed between the try and catch was to add these two lines: + recentOpcodeOffsets[numEdits % recentOpcodeOffsets.length] = + tracker.getPos();
          Hide
          Aaron T. Myers added a comment -

          I should've mentioned: this patch is for trunk.

          Show
          Aaron T. Myers added a comment - I should've mentioned: this patch is for trunk.
          Hide
          Todd Lipcon added a comment -

          Hi Aaron. Would you mind contributing a unit test for PositionTrackingInputStream? I regret that I was lazy and didn't include one in the initial revision of this patch.

          Show
          Todd Lipcon added a comment - Hi Aaron. Would you mind contributing a unit test for PositionTrackingInputStream? I regret that I was lazy and didn't include one in the initial revision of this patch.
          Hide
          Aaron T. Myers added a comment -

          Hi Todd,

          Thanks a lot for the very helpful and thorough review. In what way is the included test not sufficient?

          Best,
          Aaron

          Show
          Aaron T. Myers added a comment - Hi Todd, Thanks a lot for the very helpful and thorough review. In what way is the included test not sufficient? Best, Aaron
          Hide
          Todd Lipcon added a comment -

          Oops. I missed the test that you so thoroughly included (was looking only at changed files and missed the new one). +1 pending Hudson results.

          Show
          Todd Lipcon added a comment - Oops. I missed the test that you so thoroughly included (was looking only at changed files and missed the new one). +1 pending Hudson results.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12478673/hdfs-1378.0.patch
          against trunk revision 1101343.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.cli.TestHDFSCLI
          org.apache.hadoop.hdfs.server.namenode.TestEditLog
          org.apache.hadoop.hdfs.TestDFSShell
          org.apache.hadoop.hdfs.TestDFSStorageStateRecovery
          org.apache.hadoop.hdfs.TestFileConcurrentReader

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/474//testReport/
          Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/474//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/474//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12478673/hdfs-1378.0.patch against trunk revision 1101343. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 2 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.cli.TestHDFSCLI org.apache.hadoop.hdfs.server.namenode.TestEditLog org.apache.hadoop.hdfs.TestDFSShell org.apache.hadoop.hdfs.TestDFSStorageStateRecovery org.apache.hadoop.hdfs.TestFileConcurrentReader +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/474//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/474//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/474//console This message is automatically generated.
          Hide
          Aaron T. Myers added a comment -

          Updated patch which fixes the TestEditLog test failure. The only difference between my original patch and this one are these lines in TestEditLog.java:

          -    } catch (ChecksumException e) {
          +    } catch (IOException e) {
                 // expected
          +      assertEquals("Cause of exception should be ChecksumException",
          +          e.getCause().getClass(), ChecksumException.class);
          

          I believe the other test failures are presently failing on trunk.

          Show
          Aaron T. Myers added a comment - Updated patch which fixes the TestEditLog test failure. The only difference between my original patch and this one are these lines in TestEditLog.java : - } catch (ChecksumException e) { + } catch (IOException e) { // expected + assertEquals("Cause of exception should be ChecksumException", + e.getCause().getClass(), ChecksumException.class); I believe the other test failures are presently failing on trunk.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12478776/hdfs-1378.1.patch
          against trunk revision 1101753.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 5 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.cli.TestHDFSCLI
          org.apache.hadoop.hdfs.server.namenode.TestEditLogFileOutputStream
          org.apache.hadoop.hdfs.TestDFSShell
          org.apache.hadoop.hdfs.TestDFSStorageStateRecovery
          org.apache.hadoop.hdfs.TestFileConcurrentReader
          org.apache.hadoop.tools.TestJMXGet

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/479//testReport/
          Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/479//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/479//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12478776/hdfs-1378.1.patch against trunk revision 1101753. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 5 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.cli.TestHDFSCLI org.apache.hadoop.hdfs.server.namenode.TestEditLogFileOutputStream org.apache.hadoop.hdfs.TestDFSShell org.apache.hadoop.hdfs.TestDFSStorageStateRecovery org.apache.hadoop.hdfs.TestFileConcurrentReader org.apache.hadoop.tools.TestJMXGet +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/479//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/479//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/479//console This message is automatically generated.
          Hide
          Aaron T. Myers added a comment -

          All of the test failures except for TestEditLogFileOutputStream are known to be failing on trunk. The TestEditLogFileOutputStream failure appears to be transient. It passes on my box, and this is the message it failed with in the Jenkins run:

          noformat
          java.net.BindException: Port in use: 0.0.0.0:50070
          noformat

          Show
          Aaron T. Myers added a comment - All of the test failures except for TestEditLogFileOutputStream are known to be failing on trunk. The TestEditLogFileOutputStream failure appears to be transient. It passes on my box, and this is the message it failed with in the Jenkins run: noformat java.net.BindException: Port in use: 0.0.0.0:50070 noformat
          Hide
          Todd Lipcon added a comment -

          When looking at the patch before commit I noticed one small problem – if the error occurs in the first 4 opcodes, it wouldn't print the offsets as part of the error message. Attached patch fixes this.

          Show
          Todd Lipcon added a comment - When looking at the patch before commit I noticed one small problem – if the error occurs in the first 4 opcodes, it wouldn't print the offsets as part of the error message. Attached patch fixes this.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12478890/hdfs-1378.2.txt
          against trunk revision 1102094.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.cli.TestHDFSCLI
          org.apache.hadoop.hdfs.TestDFSShell
          org.apache.hadoop.hdfs.TestDFSStorageStateRecovery
          org.apache.hadoop.hdfs.TestFileConcurrentReader
          org.apache.hadoop.tools.TestJMXGet

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/487//testReport/
          Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/487//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/487//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12478890/hdfs-1378.2.txt against trunk revision 1102094. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.cli.TestHDFSCLI org.apache.hadoop.hdfs.TestDFSShell org.apache.hadoop.hdfs.TestDFSStorageStateRecovery org.apache.hadoop.hdfs.TestFileConcurrentReader org.apache.hadoop.tools.TestJMXGet +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/487//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/487//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/487//console This message is automatically generated.
          Hide
          Aaron T. Myers added a comment -

          Updated patch looks good to me. Thanks for catching that, Todd.

          Show
          Aaron T. Myers added a comment - Updated patch looks good to me. Thanks for catching that, Todd.
          Hide
          Todd Lipcon added a comment -

          Committed to trunk. Thanks Aaron!

          Show
          Todd Lipcon added a comment - Committed to trunk. Thanks Aaron!
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk-Commit #658 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/658/)

          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #658 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/658/ )
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #673 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk/673/)

          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #673 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk/673/ )
          Hide
          Colin Patrick McCabe added a comment -

          I'd like to port this to branch-1 so that we can have better error messages there. It should be a trivial port. Any objections?

          Show
          Colin Patrick McCabe added a comment - I'd like to port this to branch-1 so that we can have better error messages there. It should be a trivial port. Any objections?
          Hide
          Colin Patrick McCabe added a comment -
          • port to branch-1
          Show
          Colin Patrick McCabe added a comment - port to branch-1
          Hide
          Colin Patrick McCabe added a comment -
          • include bug fix from revised patch
          • backport unit test as well
          Show
          Colin Patrick McCabe added a comment - include bug fix from revised patch backport unit test as well
          Hide
          Todd Lipcon added a comment -

          Looks like your uploaded patch is still missing the unit test. Also, can you run all the edit-log related tests in branch-1 and ensure that they still pass?

          Show
          Todd Lipcon added a comment - Looks like your uploaded patch is still missing the unit test. Also, can you run all the edit-log related tests in branch-1 and ensure that they still pass?
          Hide
          Colin Patrick McCabe added a comment -
          • add unit test
          Show
          Colin Patrick McCabe added a comment - add unit test
          Hide
          Colin Patrick McCabe added a comment -

          ran TestCheckpoint, TestEditLog, TestEditLogLoading, TestNameNodeMXBean, TestSaveNamespace, TestSecurityTokenEditLog, TestStorageDirectoryFailure, TestStorageRestore

          Show
          Colin Patrick McCabe added a comment - ran TestCheckpoint, TestEditLog, TestEditLogLoading, TestNameNodeMXBean, TestSaveNamespace, TestSecurityTokenEditLog, TestStorageDirectoryFailure, TestStorageRestore
          Hide
          Todd Lipcon added a comment -

          Committed backport to branch-1. Thanks, Colin!

          Show
          Todd Lipcon added a comment - Committed backport to branch-1. Thanks, Colin!
          Hide
          Matt Foley added a comment -

          Closed upon release of Hadoop-1.1.0.

          Show
          Matt Foley added a comment - Closed upon release of Hadoop-1.1.0.

            People

            • Assignee:
              Colin Patrick McCabe
              Reporter:
              Todd Lipcon
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development