Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-1378

Edit log replay should track and report file offsets in case of errors

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.22.0
    • Fix Version/s: 0.23.0, 1.1.0
    • Component/s: namenode
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Occasionally there are bugs or operational mistakes that result in corrupt edit logs which I end up having to repair by hand. In these cases it would be very handy to have the error message also print out the file offsets of the last several edit log opcodes so it's easier to find the right place to edit in the OP_INVALID marker. We could also use this facility to provide a rough estimate of how far along edit log replay the NN is during startup (handy when a 2NN has died and replay takes a while)

      1. HDFS-1378-b1.004.patch
        9 kB
        Colin Patrick McCabe
      2. HDFS-1378-b1.003.patch
        5 kB
        Colin Patrick McCabe
      3. HDFS-1378-b1.002.patch
        4 kB
        Colin Patrick McCabe
      4. hdfs-1378.2.txt
        38 kB
        Todd Lipcon
      5. hdfs-1378.1.patch
        37 kB
        Aaron T. Myers
      6. hdfs-1378.0.patch
        36 kB
        Aaron T. Myers
      7. hdfs-1378-branch20.txt
        4 kB
        Todd Lipcon

        Issue Links

          Activity

          Todd Lipcon created issue -
          Hide
          Todd Lipcon added a comment -

          Here's a patch for branch-20, not for commit.

          In trunk the code has been refactored a bit so that the edit log loading code directly gets a DataInputStream, so we can't do it quite the same way. I'd like to change EditLogInputStream to just return an InputStream rather than DataInputStream so that we can wrap it in a position tracker as done in this patch.

          Here's example output from an edit log that got corrupted due to the root disk running out of space:

          10/09/06 11:02:30 ERROR common.Storage: Error replaying edit log at offset 1698779
          10/09/06 11:02:30 ERROR common.Storage: Last 4 opcodes at offsets: 1629141 1629329 1629546 1698775
          10/09/06 11:02:30 ERROR namenode.FSNamesystem: FSNamesystem initialization failed.
          java.io.IOException: Incorrect data format. logVersion is -18 but writables.length is 0. 
          

          From here it's very easy to use bvi to figure out where truncation or corruption occurred and fix it up.

          Show
          Todd Lipcon added a comment - Here's a patch for branch-20, not for commit. In trunk the code has been refactored a bit so that the edit log loading code directly gets a DataInputStream, so we can't do it quite the same way. I'd like to change EditLogInputStream to just return an InputStream rather than DataInputStream so that we can wrap it in a position tracker as done in this patch. Here's example output from an edit log that got corrupted due to the root disk running out of space: 10/09/06 11:02:30 ERROR common.Storage: Error replaying edit log at offset 1698779 10/09/06 11:02:30 ERROR common.Storage: Last 4 opcodes at offsets: 1629141 1629329 1629546 1698775 10/09/06 11:02:30 ERROR namenode.FSNamesystem: FSNamesystem initialization failed. java.io.IOException: Incorrect data format. logVersion is -18 but writables.length is 0. From here it's very easy to use bvi to figure out where truncation or corruption occurred and fix it up.
          Todd Lipcon made changes -
          Field Original Value New Value
          Attachment hdfs-1378-branch20.txt [ 12453953 ]
          Hide
          Aaron T. Myers added a comment -

          Patch looks pretty solid, Todd, and very helpful. One comment:

          There are large classes of edits log corruptions which will result in some exception which is not an IOE being thrown. But, this debugging info is only printed in the event an IOE is thrown. I've twice now had to change this code to catch NPE and recompile to get it to print this info. Ideally I think we'd change things so that this stuff is in a "catch (Throwable t)" block, with the actual exception being re-thrown after printing.

          Show
          Aaron T. Myers added a comment - Patch looks pretty solid, Todd, and very helpful. One comment: There are large classes of edits log corruptions which will result in some exception which is not an IOE being thrown. But, this debugging info is only printed in the event an IOE is thrown. I've twice now had to change this code to catch NPE and recompile to get it to print this info. Ideally I think we'd change things so that this stuff is in a " catch (Throwable t) " block, with the actual exception being re-thrown after printing.
          Aaron T. Myers made changes -
          Assignee Todd Lipcon [ tlipcon ] Aaron T. Myers [ atm ]
          Hide
          Aaron T. Myers added a comment -

          The patch is a little difficult to review because I indented a big block of code and diff didn't figure that out very well. The only code I changed between the try and catch was to add these two lines:

          +          recentOpcodeOffsets[numEdits % recentOpcodeOffsets.length] =
          +              tracker.getPos();
          
          Show
          Aaron T. Myers added a comment - The patch is a little difficult to review because I indented a big block of code and diff didn't figure that out very well. The only code I changed between the try and catch was to add these two lines: + recentOpcodeOffsets[numEdits % recentOpcodeOffsets.length] = + tracker.getPos();
          Aaron T. Myers made changes -
          Attachment hdfs-1378.0.patch [ 12478673 ]
          Hide
          Aaron T. Myers added a comment -

          I should've mentioned: this patch is for trunk.

          Show
          Aaron T. Myers added a comment - I should've mentioned: this patch is for trunk.
          Aaron T. Myers made changes -
          Fix Version/s 0.23.0 [ 12315571 ]
          Todd Lipcon made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Todd Lipcon added a comment -

          Hi Aaron. Would you mind contributing a unit test for PositionTrackingInputStream? I regret that I was lazy and didn't include one in the initial revision of this patch.

          Show
          Todd Lipcon added a comment - Hi Aaron. Would you mind contributing a unit test for PositionTrackingInputStream? I regret that I was lazy and didn't include one in the initial revision of this patch.
          Hide
          Aaron T. Myers added a comment -

          Hi Todd,

          Thanks a lot for the very helpful and thorough review. In what way is the included test not sufficient?

          Best,
          Aaron

          Show
          Aaron T. Myers added a comment - Hi Todd, Thanks a lot for the very helpful and thorough review. In what way is the included test not sufficient? Best, Aaron
          Hide
          Todd Lipcon added a comment -

          Oops. I missed the test that you so thoroughly included (was looking only at changed files and missed the new one). +1 pending Hudson results.

          Show
          Todd Lipcon added a comment - Oops. I missed the test that you so thoroughly included (was looking only at changed files and missed the new one). +1 pending Hudson results.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12478673/hdfs-1378.0.patch
          against trunk revision 1101343.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.cli.TestHDFSCLI
          org.apache.hadoop.hdfs.server.namenode.TestEditLog
          org.apache.hadoop.hdfs.TestDFSShell
          org.apache.hadoop.hdfs.TestDFSStorageStateRecovery
          org.apache.hadoop.hdfs.TestFileConcurrentReader

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/474//testReport/
          Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/474//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/474//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12478673/hdfs-1378.0.patch against trunk revision 1101343. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 2 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.cli.TestHDFSCLI org.apache.hadoop.hdfs.server.namenode.TestEditLog org.apache.hadoop.hdfs.TestDFSShell org.apache.hadoop.hdfs.TestDFSStorageStateRecovery org.apache.hadoop.hdfs.TestFileConcurrentReader +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/474//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/474//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/474//console This message is automatically generated.
          Hide
          Aaron T. Myers added a comment -

          Updated patch which fixes the TestEditLog test failure. The only difference between my original patch and this one are these lines in TestEditLog.java:

          -    } catch (ChecksumException e) {
          +    } catch (IOException e) {
                 // expected
          +      assertEquals("Cause of exception should be ChecksumException",
          +          e.getCause().getClass(), ChecksumException.class);
          

          I believe the other test failures are presently failing on trunk.

          Show
          Aaron T. Myers added a comment - Updated patch which fixes the TestEditLog test failure. The only difference between my original patch and this one are these lines in TestEditLog.java : - } catch (ChecksumException e) { + } catch (IOException e) { // expected + assertEquals("Cause of exception should be ChecksumException", + e.getCause().getClass(), ChecksumException.class); I believe the other test failures are presently failing on trunk.
          Aaron T. Myers made changes -
          Attachment hdfs-1378.1.patch [ 12478776 ]
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12478776/hdfs-1378.1.patch
          against trunk revision 1101753.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 5 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.cli.TestHDFSCLI
          org.apache.hadoop.hdfs.server.namenode.TestEditLogFileOutputStream
          org.apache.hadoop.hdfs.TestDFSShell
          org.apache.hadoop.hdfs.TestDFSStorageStateRecovery
          org.apache.hadoop.hdfs.TestFileConcurrentReader
          org.apache.hadoop.tools.TestJMXGet

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/479//testReport/
          Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/479//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/479//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12478776/hdfs-1378.1.patch against trunk revision 1101753. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 5 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.cli.TestHDFSCLI org.apache.hadoop.hdfs.server.namenode.TestEditLogFileOutputStream org.apache.hadoop.hdfs.TestDFSShell org.apache.hadoop.hdfs.TestDFSStorageStateRecovery org.apache.hadoop.hdfs.TestFileConcurrentReader org.apache.hadoop.tools.TestJMXGet +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/479//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/479//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/479//console This message is automatically generated.
          Hide
          Aaron T. Myers added a comment -

          All of the test failures except for TestEditLogFileOutputStream are known to be failing on trunk. The TestEditLogFileOutputStream failure appears to be transient. It passes on my box, and this is the message it failed with in the Jenkins run:

          noformat
          java.net.BindException: Port in use: 0.0.0.0:50070
          noformat

          Show
          Aaron T. Myers added a comment - All of the test failures except for TestEditLogFileOutputStream are known to be failing on trunk. The TestEditLogFileOutputStream failure appears to be transient. It passes on my box, and this is the message it failed with in the Jenkins run: noformat java.net.BindException: Port in use: 0.0.0.0:50070 noformat
          Hide
          Todd Lipcon added a comment -

          When looking at the patch before commit I noticed one small problem – if the error occurs in the first 4 opcodes, it wouldn't print the offsets as part of the error message. Attached patch fixes this.

          Show
          Todd Lipcon added a comment - When looking at the patch before commit I noticed one small problem – if the error occurs in the first 4 opcodes, it wouldn't print the offsets as part of the error message. Attached patch fixes this.
          Todd Lipcon made changes -
          Attachment hdfs-1378.2.txt [ 12478890 ]
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12478890/hdfs-1378.2.txt
          against trunk revision 1102094.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.cli.TestHDFSCLI
          org.apache.hadoop.hdfs.TestDFSShell
          org.apache.hadoop.hdfs.TestDFSStorageStateRecovery
          org.apache.hadoop.hdfs.TestFileConcurrentReader
          org.apache.hadoop.tools.TestJMXGet

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/487//testReport/
          Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/487//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/487//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12478890/hdfs-1378.2.txt against trunk revision 1102094. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.cli.TestHDFSCLI org.apache.hadoop.hdfs.TestDFSShell org.apache.hadoop.hdfs.TestDFSStorageStateRecovery org.apache.hadoop.hdfs.TestFileConcurrentReader org.apache.hadoop.tools.TestJMXGet +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/487//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/487//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/487//console This message is automatically generated.
          Hide
          Aaron T. Myers added a comment -

          Updated patch looks good to me. Thanks for catching that, Todd.

          Show
          Aaron T. Myers added a comment - Updated patch looks good to me. Thanks for catching that, Todd.
          Hide
          Todd Lipcon added a comment -

          Committed to trunk. Thanks Aaron!

          Show
          Todd Lipcon added a comment - Committed to trunk. Thanks Aaron!
          Todd Lipcon made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Hadoop Flags [Reviewed]
          Resolution Fixed [ 1 ]
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk-Commit #658 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/658/)

          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #658 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/658/ )
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #673 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk/673/)

          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #673 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk/673/ )
          Hide
          Colin Patrick McCabe added a comment -

          I'd like to port this to branch-1 so that we can have better error messages there. It should be a trivial port. Any objections?

          Show
          Colin Patrick McCabe added a comment - I'd like to port this to branch-1 so that we can have better error messages there. It should be a trivial port. Any objections?
          Colin Patrick McCabe made changes -
          Resolution Fixed [ 1 ]
          Status Resolved [ 5 ] Reopened [ 4 ]
          Assignee Aaron T. Myers [ atm ] Colin Patrick McCabe [ cmccabe ]
          Hide
          Colin Patrick McCabe added a comment -
          • port to branch-1
          Show
          Colin Patrick McCabe added a comment - port to branch-1
          Colin Patrick McCabe made changes -
          Attachment HDFS-1378-b1.002.patch [ 12521236 ]
          Hide
          Colin Patrick McCabe added a comment -
          • include bug fix from revised patch
          • backport unit test as well
          Show
          Colin Patrick McCabe added a comment - include bug fix from revised patch backport unit test as well
          Colin Patrick McCabe made changes -
          Attachment HDFS-1378-b1.003.patch [ 12521238 ]
          Colin Patrick McCabe made changes -
          Link This issue blocks HDFS-3055 [ HDFS-3055 ]
          Hide
          Todd Lipcon added a comment -

          Looks like your uploaded patch is still missing the unit test. Also, can you run all the edit-log related tests in branch-1 and ensure that they still pass?

          Show
          Todd Lipcon added a comment - Looks like your uploaded patch is still missing the unit test. Also, can you run all the edit-log related tests in branch-1 and ensure that they still pass?
          Hide
          Colin Patrick McCabe added a comment -
          • add unit test
          Show
          Colin Patrick McCabe added a comment - add unit test
          Colin Patrick McCabe made changes -
          Attachment HDFS-1378-b1.004.patch [ 12521246 ]
          Hide
          Colin Patrick McCabe added a comment -

          ran TestCheckpoint, TestEditLog, TestEditLogLoading, TestNameNodeMXBean, TestSaveNamespace, TestSecurityTokenEditLog, TestStorageDirectoryFailure, TestStorageRestore

          Show
          Colin Patrick McCabe added a comment - ran TestCheckpoint, TestEditLog, TestEditLogLoading, TestNameNodeMXBean, TestSaveNamespace, TestSecurityTokenEditLog, TestStorageDirectoryFailure, TestStorageRestore
          Hide
          Todd Lipcon added a comment -

          Committed backport to branch-1. Thanks, Colin!

          Show
          Todd Lipcon added a comment - Committed backport to branch-1. Thanks, Colin!
          Todd Lipcon made changes -
          Status Reopened [ 4 ] Resolved [ 5 ]
          Fix Version/s 1.1.0 [ 12317959 ]
          Resolution Fixed [ 1 ]
          Hide
          Matt Foley added a comment -

          Closed upon release of Hadoop-1.1.0.

          Show
          Matt Foley added a comment - Closed upon release of Hadoop-1.1.0.
          Matt Foley made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Gavin made changes -
          Link This issue blocks HDFS-3055 [ HDFS-3055 ]
          Gavin made changes -
          Link This issue is depended upon by HDFS-3055 [ HDFS-3055 ]
          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open Patch Available Patch Available
          246d 3h 29m 1 Todd Lipcon 10/May/11 22:29
          Patch Available Patch Available Resolved Resolved
          1d 2h 48m 1 Todd Lipcon 12/May/11 01:18
          Resolved Resolved Reopened Reopened
          327d 22h 40m 1 Colin Patrick McCabe 03/Apr/12 23:59
          Reopened Reopened Resolved Resolved
          1d 22m 1 Todd Lipcon 05/Apr/12 00:22
          Resolved Resolved Closed Closed
          195d 19h 5m 1 Matt Foley 17/Oct/12 19:27

            People

            • Assignee:
              Colin Patrick McCabe
              Reporter:
              Todd Lipcon
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development