Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-3335

check for edit log corruption at the end of the log

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.23.0
    • Fix Version/s: 2.0.2-alpha
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Even after encountering an OP_INVALID, we should check the end of the edit log to make sure that it contains no more edits.

      This will catch things like rare race conditions or log corruptions that would otherwise remain undetected. They will got from being silent data loss scenarios to being cases that we can detect and fix.

      Using recovery mode, we can choose to ignore the end of the log if necessary.

      1. HDFS-3335-b1.001.patch
        10 kB
        Colin Patrick McCabe
      2. HDFS-3335-b1.002.patch
        10 kB
        Colin Patrick McCabe
      3. HDFS-3335-b1.003.patch
        10 kB
        Colin Patrick McCabe
      4. HDFS-3335.001.patch
        9 kB
        Colin Patrick McCabe
      5. HDFS-3335.002.patch
        23 kB
        Colin Patrick McCabe
      6. HDFS-3335.003.patch
        23 kB
        Colin Patrick McCabe
      7. HDFS-3335.004.patch
        24 kB
        Colin Patrick McCabe
      8. HDFS-3335.005.patch
        24 kB
        Colin Patrick McCabe
      9. HDFS-3335.006.patch
        25 kB
        Colin Patrick McCabe
      10. HDFS-3335.007.patch
        25 kB
        Colin Patrick McCabe
      11. HDFS-3335-b1.004.patch
        11 kB
        Colin Patrick McCabe
      12. HDFS-3335.008.patch
        23 kB
        Colin Patrick McCabe
      13. HDFS-3335.009.patch
        24 kB
        Colin Patrick McCabe
      14. HDFS-3335.010.patch
        24 kB
        Colin Patrick McCabe
      15. HDFS-3335.011.patch
        25 kB
        Colin Patrick McCabe

        Issue Links

          Activity

          Hide
          Colin Patrick McCabe added a comment -

          I ran the following unit tests on this one:

          TestCheckpoint
          TestEditLog
          TestNameNodeRecovery
          TestEditLogLoading
          TestNameNodeMXBean
          TestSaveNamespace
          TestSecurityTokenEditLog
          TestStorageDirectoryFailure
          TestStorageRestore
          
          Show
          Colin Patrick McCabe added a comment - I ran the following unit tests on this one: TestCheckpoint TestEditLog TestNameNodeRecovery TestEditLogLoading TestNameNodeMXBean TestSaveNamespace TestSecurityTokenEditLog TestStorageDirectoryFailure TestStorageRestore
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12524954/HDFS-3335-b1.001.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2348//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12524954/HDFS-3335-b1.001.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 1 new or modified test files. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2348//console This message is automatically generated.
          Hide
          Colin Patrick McCabe added a comment -
          • Fix 0xff / -1 confusion
          Show
          Colin Patrick McCabe added a comment - Fix 0xff / -1 confusion
          Hide
          Todd Lipcon added a comment -

          A few thoughts:

          • we need to do this in trunk before we can review/commit to branch-1
          • I'm afraid this may break the HA use case, where the previous active may crash in the middle of an fsync(). This could leave a partial edit at the end of the file. Let's complete HDFS-2766 before we enable this lest it break the HA case.
          Show
          Todd Lipcon added a comment - A few thoughts: we need to do this in trunk before we can review/commit to branch-1 I'm afraid this may break the HA use case, where the previous active may crash in the middle of an fsync(). This could leave a partial edit at the end of the file. Let's complete HDFS-2766 before we enable this lest it break the HA case.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12525208/HDFS-3335-b1.002.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2355//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12525208/HDFS-3335-b1.002.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 1 new or modified test files. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2355//console This message is automatically generated.
          Hide
          Colin Patrick McCabe added a comment -
          • added a test of padding with 0xff bytes, to ensure that we handle that case correctly.
          Show
          Colin Patrick McCabe added a comment - added a test of padding with 0xff bytes, to ensure that we handle that case correctly.
          Hide
          Colin Patrick McCabe added a comment -

          Hi Todd,

          I will post a patch for trunk shortly. I think for now we can disable this check for the HA caes, until we get a better solution for that.

          Show
          Colin Patrick McCabe added a comment - Hi Todd, I will post a patch for trunk shortly. I think for now we can disable this check for the HA caes, until we get a better solution for that.
          Hide
          Colin Patrick McCabe added a comment -

          Add patch for trunk.

          • Always check for corruption at the end of the log, unless we are reading the last megabyte of an unfinalized edit log.
          • Create constant for PREALLOCATION_LENGTH. Use it everywhere we were previously copying-and-pasting the integer.
          • FSEditLogLoader: log any exceptions that we get when reading the edit log.
          • FSEditLogOp: add verifyEndOfLog
          • FSEditLogOp: restrict serialized FSEditLogOp operations to a megabyte in size.
            Previously, I was trying to avoid having a hard limit on edit log operation size by doing
            in.mark(in.available())
            

          However, this is not valid. in.available() does NOT reflect how many bytes remain in the stream-- it just reflects the number that can be read without blocking, a number which may or may not be related to the amount left.

          Show
          Colin Patrick McCabe added a comment - Add patch for trunk. Always check for corruption at the end of the log, unless we are reading the last megabyte of an unfinalized edit log. Create constant for PREALLOCATION_LENGTH. Use it everywhere we were previously copying-and-pasting the integer. FSEditLogLoader: log any exceptions that we get when reading the edit log. FSEditLogOp: add verifyEndOfLog FSEditLogOp: restrict serialized FSEditLogOp operations to a megabyte in size. Previously, I was trying to avoid having a hard limit on edit log operation size by doing in.mark(in.available()) However, this is not valid. in.available() does NOT reflect how many bytes remain in the stream-- it just reflects the number that can be read without blocking, a number which may or may not be related to the amount left.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12525247/HDFS-3335.001.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified test files.

          -1 javadoc. The javadoc tool appears to have generated 2 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.hdfs.TestPersistBlocks

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2358//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/2358//artifact/trunk/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2358//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12525247/HDFS-3335.001.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 2 new or modified test files. -1 javadoc. The javadoc tool appears to have generated 2 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 eclipse:eclipse. The patch built with eclipse:eclipse. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestPersistBlocks +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2358//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/2358//artifact/trunk/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2358//console This message is automatically generated.
          Hide
          Colin Patrick McCabe added a comment -
          • fix the strategy a bit.
            Previously, the goal was to special-case in-progress logs. However, after recoverUnfinalizedSegments has run, there are no more in-progress logs. Any formerly in-progress logs are renamed and given a final transaction ID.

          So, to avoid choking on garbage at the end of the log, simply make sure that we don't read past the final transaction ID of the stream.

          • add unit tests
          • FSEditLogOp#readOp: must check for runtime exceptions in order to be robust.
          Show
          Colin Patrick McCabe added a comment - fix the strategy a bit. Previously, the goal was to special-case in-progress logs. However, after recoverUnfinalizedSegments has run, there are no more in-progress logs. Any formerly in-progress logs are renamed and given a final transaction ID. So, to avoid choking on garbage at the end of the log, simply make sure that we don't read past the final transaction ID of the stream. add unit tests FSEditLogOp#readOp: must check for runtime exceptions in order to be robust.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12525355/HDFS-3335.002.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified test files.

          -1 javadoc. The javadoc tool appears to have generated 2 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.hdfs.TestDFSUpgradeFromImage
          org.apache.hadoop.hdfs.server.namenode.TestNameNodeRecovery
          org.apache.hadoop.hdfs.TestPersistBlocks

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2364//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/2364//artifact/trunk/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2364//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12525355/HDFS-3335.002.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified test files. -1 javadoc. The javadoc tool appears to have generated 2 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 eclipse:eclipse. The patch built with eclipse:eclipse. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestDFSUpgradeFromImage org.apache.hadoop.hdfs.server.namenode.TestNameNodeRecovery org.apache.hadoop.hdfs.TestPersistBlocks +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2364//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/2364//artifact/trunk/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2364//console This message is automatically generated.
          Hide
          Colin Patrick McCabe added a comment -
          • correctly handle pre-transactional edit logs
          • in FSEditLogOp#readOp, catch Throwable. Thi should be unecessary once HDFS-3134 is in, but it is not yet.
          Show
          Colin Patrick McCabe added a comment - correctly handle pre-transactional edit logs in FSEditLogOp#readOp, catch Throwable. Thi should be unecessary once HDFS-3134 is in, but it is not yet.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12525541/HDFS-3335.003.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified test files.

          -1 javadoc. The javadoc tool appears to have generated 2 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.hdfs.TestPersistBlocks

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2372//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2372//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12525541/HDFS-3335.003.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified test files. -1 javadoc. The javadoc tool appears to have generated 2 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestPersistBlocks +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2372//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2372//console This message is automatically generated.
          Hide
          Colin Patrick McCabe added a comment -
          • Don't call reset() once EOF has been declared
          Show
          Colin Patrick McCabe added a comment - Don't call reset() once EOF has been declared
          Hide
          Colin Patrick McCabe added a comment -
          • TestNameNodeRecovery: fix comment
          Show
          Colin Patrick McCabe added a comment - TestNameNodeRecovery: fix comment
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12525692/HDFS-3335.005.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified test files.

          -1 javadoc. The javadoc tool appears to have generated 2 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2379//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2379//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12525692/HDFS-3335.005.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified test files. -1 javadoc. The javadoc tool appears to have generated 2 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2379//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2379//console This message is automatically generated.
          Hide
          Colin Patrick McCabe added a comment -
          • rename verifyEndOfLog to verifyTerminator
          • add GarbageAfterTerminatorException. This exception stores the number of padding bytes we successfully read. This allows us to skip them after a resync(). This makes recovery work faster, which is important for unit tests.
          Show
          Colin Patrick McCabe added a comment - rename verifyEndOfLog to verifyTerminator add GarbageAfterTerminatorException. This exception stores the number of padding bytes we successfully read. This allows us to skip them after a resync(). This makes recovery work faster, which is important for unit tests.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12526074/HDFS-3335.006.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified test files.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2393//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/2393//artifact/trunk/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2393//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12526074/HDFS-3335.006.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified test files. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 eclipse:eclipse. The patch built with eclipse:eclipse. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2393//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/2393//artifact/trunk/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2393//console This message is automatically generated.
          Hide
          Colin Patrick McCabe added a comment -
          • EditLogFileInputStream: check for txid >= highest txid, not just equal
          • FSEditLogOp: handle RuntimeExceptions thrown from readOp.
          Show
          Colin Patrick McCabe added a comment - EditLogFileInputStream: check for txid >= highest txid, not just equal FSEditLogOp: handle RuntimeExceptions thrown from readOp.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12526396/HDFS-3335.007.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified test files.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2411//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2411//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12526396/HDFS-3335.007.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified test files. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2411//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2411//console This message is automatically generated.
          Hide
          Colin Patrick McCabe added a comment -

          For the branch-1 patch:

          Ignore corruption after the sentinel as long as it takes place in the last 2 megabytes of the log.

          Test this exception.

          Show
          Colin Patrick McCabe added a comment - For the branch-1 patch: Ignore corruption after the sentinel as long as it takes place in the last 2 megabytes of the log. Test this exception.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12526442/HDFS-3335-b1.004.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2417//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12526442/HDFS-3335-b1.004.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 1 new or modified test files. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2417//console This message is automatically generated.
          Hide
          Todd Lipcon added a comment -

          In EditLogFileInputStream.nextOp, we should log a WARN message with the file name and data on how many bytes are skipped at the end of the file. This way, if there is an error replaying later, you might notice that in fact you did want to recover some of these edits. Having the warning in the log will make it easier to find where they went.

          In this place, it would also be nice to detect how many of those bytes were just 0xffffffff padding vs data that potentially looks like transactions.


          • Rename GarbageAfterTerminatorException.getOffset to something a little more clear – right now it's not obvious that this is a relative offset/length after the OP_INVALID, versus an offset since the beginning of the file, etc. Perhaps getPaddingLengthAfterEofMarker? I'm still not entirely clear what this length represents... by my reading of the javadoc, it is:
          <--- valid edits ---> < OP_INVALID > <-- N bytes of padding --> <-- non-padding data --> EOF
          

          where N above is what you're talking about?

          Maybe some ASCII art like the above in the javadoc would be helpful.

          Part of what is confusing me is this: does padding after OP_INVALID count as garbage or not?


          +  /** Testing hook */
          +  void setEditLog(FSEditLog newLog) {
          

          Can you add @VisibleForTesting and change to setEditLogForTesting so no one starts to use it in non-test code?


          • Lots of spurious whitespace changes in TestNameNodeRecovery
          • Can you add brief javadoc to the three implementations of Corruptor? eg "/** Truncate the last byte of the file /", "/ Add padding followed by some non-padding bytes to the end of the file /" and "/* Add only padding to the end of the file */"?

          Otherwise really nice tests.

          Show
          Todd Lipcon added a comment - In EditLogFileInputStream.nextOp , we should log a WARN message with the file name and data on how many bytes are skipped at the end of the file. This way, if there is an error replaying later, you might notice that in fact you did want to recover some of these edits. Having the warning in the log will make it easier to find where they went. In this place, it would also be nice to detect how many of those bytes were just 0xffffffff padding vs data that potentially looks like transactions. Rename GarbageAfterTerminatorException.getOffset to something a little more clear – right now it's not obvious that this is a relative offset/length after the OP_INVALID, versus an offset since the beginning of the file, etc. Perhaps getPaddingLengthAfterEofMarker ? I'm still not entirely clear what this length represents... by my reading of the javadoc, it is: <--- valid edits ---> < OP_INVALID > <-- N bytes of padding --> <-- non-padding data --> EOF where N above is what you're talking about? Maybe some ASCII art like the above in the javadoc would be helpful. Part of what is confusing me is this: does padding after OP_INVALID count as garbage or not? + /** Testing hook */ + void setEditLog(FSEditLog newLog) { Can you add @VisibleForTesting and change to setEditLogForTesting so no one starts to use it in non-test code? Lots of spurious whitespace changes in TestNameNodeRecovery Can you add brief javadoc to the three implementations of Corruptor? eg "/** Truncate the last byte of the file /", "/ Add padding followed by some non-padding bytes to the end of the file /" and "/ * Add only padding to the end of the file */"? Otherwise really nice tests.
          Hide
          Colin Patrick McCabe added a comment -
          • Yeah, your understanding of GarbageAfterTerminatorException#getOffset is correct. I'll rename it to something clearer.

          .bq Part of what is confusing me is this: does padding after OP_INVALID count as garbage or not?

          No, padding is just zeros or 0xffs. Garbage is something you wouldn't expect to be there, like more opcodes, random bytes, or something like that.

          • I'll see if I can remove the unecessary whitespace diffs...
          Show
          Colin Patrick McCabe added a comment - Yeah, your understanding of GarbageAfterTerminatorException#getOffset is correct. I'll rename it to something clearer. .bq Part of what is confusing me is this: does padding after OP_INVALID count as garbage or not? No, padding is just zeros or 0xffs. Garbage is something you wouldn't expect to be there, like more opcodes, random bytes, or something like that. I'll see if I can remove the unecessary whitespace diffs...
          Hide
          Colin Patrick McCabe added a comment -
          • EditLogFileInputStream: warn when skipping the last few bytes in a file.
          • rename GarbageAfterTerminatorException#offset to numAfterTerminator.
          • Rename FSImage#setEditLog to FSImage#setEditLogForTesting and add @VisibleForTesting annotation to it
          • avoid making some unecessary whitespace changes
          Show
          Colin Patrick McCabe added a comment - EditLogFileInputStream: warn when skipping the last few bytes in a file. rename GarbageAfterTerminatorException#offset to numAfterTerminator. Rename FSImage#setEditLog to FSImage#setEditLogForTesting and add @VisibleForTesting annotation to it avoid making some unecessary whitespace changes
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12526470/HDFS-3335.008.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2422//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2422//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12526470/HDFS-3335.008.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 javadoc. The javadoc tool did not generate any warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2422//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2422//console This message is automatically generated.
          Hide
          Colin Patrick McCabe added a comment -
          • EditLogFileInputStream#nextOp: use tracker.skip rather than fStream.skip. Because of buffering, just calling skip on the underlying stream is not sufficient.
          • FSEditLogLoader#PositionTrackingInputStream: add a skip method. Also, overridden methods should be annotated with @Override.
          • TestNameNodeRecovery: when avoiding edit log finalization, stub out FSEditLog#endCurrentLogSegment rather than FSEditLog#close. Stubbing out close results in a file descriptor leak.
          • TestNameNodeRecovery#PaddingCorruptor: this form of corruption actually doesn't require recovery mode.
          Show
          Colin Patrick McCabe added a comment - EditLogFileInputStream#nextOp: use tracker.skip rather than fStream.skip. Because of buffering, just calling skip on the underlying stream is not sufficient. FSEditLogLoader#PositionTrackingInputStream: add a skip method. Also, overridden methods should be annotated with @Override. TestNameNodeRecovery: when avoiding edit log finalization, stub out FSEditLog#endCurrentLogSegment rather than FSEditLog#close. Stubbing out close results in a file descriptor leak. TestNameNodeRecovery#PaddingCorruptor: this form of corruption actually doesn't require recovery mode.
          Hide
          Colin Patrick McCabe added a comment -
          • update comments a bit
          Show
          Colin Patrick McCabe added a comment - update comments a bit
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12526556/HDFS-3335.009.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.hdfs.tools.offlineEditsViewer.TestOfflineEditsViewer

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2425//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2425//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12526556/HDFS-3335.009.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 javadoc. The javadoc tool did not generate any warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.tools.offlineEditsViewer.TestOfflineEditsViewer +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2425//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2425//console This message is automatically generated.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12526557/HDFS-3335.010.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.hdfs.tools.offlineEditsViewer.TestOfflineEditsViewer

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2426//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2426//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12526557/HDFS-3335.010.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 javadoc. The javadoc tool did not generate any warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.tools.offlineEditsViewer.TestOfflineEditsViewer +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2426//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2426//console This message is automatically generated.
          Hide
          Colin Patrick McCabe added a comment -
          • increase MAX_OP_SIZE to 100 MB. Since we are limiting a bunch of strings to 1 MB in size, and we can have a few of them in an opcode, it only makes sense.
          • In the OfflineEditsViewer (oev), we have to initialize the EditLogFileInputStream with startTx = endTxId = INVALID_TXID. This is a special value that indicates that we don't know where the log ends. Previously, I had been passing -1 as the start and end txid, but the convention throughout the code is that only INVALID_TXID may be used for this purpose. This fixes the unit test failure.
          Show
          Colin Patrick McCabe added a comment - increase MAX_OP_SIZE to 100 MB. Since we are limiting a bunch of strings to 1 MB in size, and we can have a few of them in an opcode, it only makes sense. In the OfflineEditsViewer (oev), we have to initialize the EditLogFileInputStream with startTx = endTxId = INVALID_TXID. This is a special value that indicates that we don't know where the log ends. Previously, I had been passing -1 as the start and end txid, but the convention throughout the code is that only INVALID_TXID may be used for this purpose. This fixes the unit test failure.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12526723/HDFS-3335.011.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2439//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2439//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12526723/HDFS-3335.011.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 javadoc. The javadoc tool did not generate any warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2439//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2439//console This message is automatically generated.
          Hide
          Todd Lipcon added a comment -
          +import org.mortbay.log.Log;
          +
          

          This is the wrong log - you should make a commons-logging Logger object (see elsewhere in the code for examples)


          +          if (in.skip(e.getNumAfterTerminator()) < e.getNumAfterTerminator()) {
          +            return null;
                     }
          

          Can you explain this condition to me? When would skip return fewer bytes than you ask for? It seems it would only be on EOF, and the fact that we got this value for getNumAfterTerminator would indicate that we shouldn't get an EOF there. So I think you should always be able to skip that many bytes, no? I would have expected to see an exception thrown inside the if statement.

          Also, given that we reset() the stream back to the marked position, then shouldn't we be skipping 1 + the indicated number of bytes, given that the marked position is before the OP_INVALID, not after it, whereas the counter of padding is after it?

          Show
          Todd Lipcon added a comment - + import org.mortbay.log.Log; + This is the wrong log - you should make a commons-logging Logger object (see elsewhere in the code for examples) + if (in.skip(e.getNumAfterTerminator()) < e.getNumAfterTerminator()) { + return null ; } Can you explain this condition to me? When would skip return fewer bytes than you ask for? It seems it would only be on EOF, and the fact that we got this value for getNumAfterTerminator would indicate that we shouldn't get an EOF there. So I think you should always be able to skip that many bytes, no? I would have expected to see an exception thrown inside the if statement. Also, given that we reset() the stream back to the marked position, then shouldn't we be skipping 1 + the indicated number of bytes, given that the marked position is before the OP_INVALID, not after it, whereas the counter of padding is after it?
          Hide
          Colin Patrick McCabe added a comment -

          This is the wrong log - you should make a commons-logging Logger object (see elsewhere in the code for examples)

          Yeah

          Can you explain this condition to me? When would skip return fewer bytes than you ask for? It seems it would only be on EOF, and the fact that we got this value for getNumAfterTerminator would indicate that we shouldn't get an EOF there. So I think you should always be able to skip that many bytes, no? I would have expected to see an exception thrown inside the if statement.

          I originally was not checking the return value of in.skip, for obvious reasons (it's supposed to work, since we just verified that there are that many bytes in the file). I got a findbugs warning so I added this check. I'll add an exception...

          Also, given that we reset() the stream back to the marked position, then shouldn't we be skipping 1 + the indicated number of bytes, given that the marked position is before the OP_INVALID, not after it, whereas the counter of padding is after it?

          There's a mandatory skip(1) at the bottom of the loop.

          Show
          Colin Patrick McCabe added a comment - This is the wrong log - you should make a commons-logging Logger object (see elsewhere in the code for examples) Yeah Can you explain this condition to me? When would skip return fewer bytes than you ask for? It seems it would only be on EOF, and the fact that we got this value for getNumAfterTerminator would indicate that we shouldn't get an EOF there. So I think you should always be able to skip that many bytes, no? I would have expected to see an exception thrown inside the if statement. I originally was not checking the return value of in.skip, for obvious reasons (it's supposed to work, since we just verified that there are that many bytes in the file). I got a findbugs warning so I added this check. I'll add an exception... Also, given that we reset() the stream back to the marked position, then shouldn't we be skipping 1 + the indicated number of bytes, given that the marked position is before the OP_INVALID, not after it, whereas the counter of padding is after it? There's a mandatory skip(1) at the bottom of the loop.
          Hide
          Colin Patrick McCabe added a comment -

          since attachments are broken, here's my latest patch as a diff against the previous one:

          --- /home/cmccabe/list/HDFS-3335_check_for_edit_log_corruption_at_the_end_of_the_log//HDFS-3335.011.patch	2012-05-14 00:51:23.999044041 -0700
          +++ /home/cmccabe/list/HDFS-3335_check_for_edit_log_corruption_at_the_end_of_the_log//HDFS-3335.012.patch	2012-05-14 15:12:51.081231207 -0700
          @@ -1,17 +1,8 @@
           diff --git hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogFileInputStream.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogFileInputStream.java
          -index 0b00187..c985ff9 100644
          +index 0b00187..29c90e9 100644
           --- hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogFileInputStream.java
           +++ hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogFileInputStream.java
          -@@ -27,6 +27,8 @@ import java.io.DataInputStream;
          - import org.apache.hadoop.hdfs.protocol.HdfsConstants;
          - import org.apache.hadoop.hdfs.server.common.Storage;
          - import org.apache.hadoop.io.IOUtils;
          -+import org.mortbay.log.Log;
          -+
          - import com.google.common.annotations.VisibleForTesting;
          - 
          - /**
          -@@ -106,7 +108,34 @@ public class EditLogFileInputStream extends EditLogInputStream {
          +@@ -106,7 +106,35 @@ public class EditLogFileInputStream extends EditLogInputStream {
            
              @Override
              protected FSEditLogOp nextOp() throws IOException {
          @@ -37,8 +28,9 @@
           +        //
           +        long skipAmt = file.length() - tracker.getPos();
           +        if (skipAmt > 0) {
          -+          Log.warn("skipping " + skipAmt + " bytes at the end of edit log  '" +
          -+              getName() + "': reached txid " + txId + " out of " + lastTxId);
          ++          FSImage.LOG.warn("skipping " + skipAmt + " bytes at the end " +
          ++              "of edit log  '" + getName() + "': reached txid " + txId +
          ++              " out of " + lastTxId);
           +          tracker.skip(skipAmt);
           +        }
           +      }
          @@ -129,7 +121,7 @@
            
              public long getLastAppliedTxId() {
           diff --git hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogOp.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogOp.java
          -index 9f7742c..35b4557 100644
          +index 9f7742c..2eadd4b 100644
           --- hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogOp.java
           +++ hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogOp.java
           @@ -75,6 +75,7 @@ import java.io.EOFException;
          @@ -140,8 +132,16 @@
            
            
              @SuppressWarnings("deprecation")
          -@@ -2267,26 +2268,66 @@ public abstract class FSEditLogOp {
          -      * @throws IOException on error.
          +@@ -2263,30 +2264,75 @@ public abstract class FSEditLogOp {
          +      * 
          +      * @param skipBrokenEdits    If true, attempt to skip over damaged parts of
          +      * the input stream, rather than throwing an IOException
          +-     * @return the operation read from the stream, or null at the end of the file
          +-     * @throws IOException on error.
          ++     * @return the operation read from the stream, or null at the end of the 
          ++     *         file
          ++     * @throws IOException on error.  This function should only throw an
          ++     *         exception when skipBrokenEdits is false.
                 */
                public FSEditLogOp readOp(boolean skipBrokenEdits) throws IOException {
           -      FSEditLogOp op = null;
          @@ -166,7 +166,10 @@
           +          // If we saw a terminator opcode followed by a long region of 0x00 or
           +          // 0xff, we want to skip over that region, because there's nothing
           +          // interesting there.
          -+          if (in.skip(e.getNumAfterTerminator()) < e.getNumAfterTerminator()) {
          ++          long numSkip = e.getNumAfterTerminator();
          ++          if (in.skip(numSkip) < numSkip) {
          ++            FSImage.LOG.error("Failed to skip " + numSkip + " bytes of " +
          ++              "garbage after an OP_INVALID.  Unexpected early EOF.");
           +            return null;
                      }
           -          return op;
          @@ -220,7 +223,7 @@
                      }
                    }
                  }
          -@@ -2306,8 +2347,10 @@ public abstract class FSEditLogOp {
          +@@ -2306,8 +2352,10 @@ public abstract class FSEditLogOp {
                  }
            
                  FSEditLogOpCodes opCode = FSEditLogOpCodes.fromByte(opCodeByte);
          @@ -232,7 +235,7 @@
            
                  FSEditLogOp op = cache.get(opCode);
                  if (op == null) {
          -@@ -2477,4 +2520,35 @@ public abstract class FSEditLogOp {
          +@@ -2477,4 +2525,35 @@ public abstract class FSEditLogOp {
                short mode = Short.valueOf(st.getValue("MODE"));
                return new PermissionStatus(username, groupname, new FsPermission(mode));
              }
          
          Show
          Colin Patrick McCabe added a comment - since attachments are broken, here's my latest patch as a diff against the previous one: --- /home/cmccabe/list/HDFS-3335_check_for_edit_log_corruption_at_the_end_of_the_log //HDFS-3335.011.patch 2012-05-14 00:51:23.999044041 -0700 +++ /home/cmccabe/list/HDFS-3335_check_for_edit_log_corruption_at_the_end_of_the_log //HDFS-3335.012.patch 2012-05-14 15:12:51.081231207 -0700 @@ -1,17 +1,8 @@ diff --git hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogFileInputStream.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogFileInputStream.java -index 0b00187..c985ff9 100644 +index 0b00187..29c90e9 100644 --- hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogFileInputStream.java +++ hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogFileInputStream.java -@@ -27,6 +27,8 @@ import java.io.DataInputStream; - import org.apache.hadoop.hdfs.protocol.HdfsConstants; - import org.apache.hadoop.hdfs.server.common.Storage; - import org.apache.hadoop.io.IOUtils; -+ import org.mortbay.log.Log; -+ - import com.google.common.annotations.VisibleForTesting; - - /** -@@ -106,7 +108,34 @@ public class EditLogFileInputStream extends EditLogInputStream { +@@ -106,7 +106,35 @@ public class EditLogFileInputStream extends EditLogInputStream { @Override protected FSEditLogOp nextOp() throws IOException { @@ -37,8 +28,9 @@ + // + long skipAmt = file.length() - tracker.getPos(); + if (skipAmt > 0) { -+ Log.warn( "skipping " + skipAmt + " bytes at the end of edit log '" + -+ getName() + "': reached txid " + txId + " out of " + lastTxId); ++ FSImage.LOG.warn( "skipping " + skipAmt + " bytes at the end " + ++ "of edit log '" + getName() + "': reached txid " + txId + ++ " out of " + lastTxId); + tracker.skip(skipAmt); + } + } @@ -129,7 +121,7 @@ public long getLastAppliedTxId() { diff --git hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogOp.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogOp.java -index 9f7742c..35b4557 100644 +index 9f7742c..2eadd4b 100644 --- hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogOp.java +++ hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogOp.java @@ -75,6 +75,7 @@ import java.io.EOFException; @@ -140,8 +132,16 @@ @SuppressWarnings( "deprecation" ) -@@ -2267,26 +2268,66 @@ public abstract class FSEditLogOp { - * @ throws IOException on error. +@@ -2263,30 +2264,75 @@ public abstract class FSEditLogOp { + * + * @param skipBrokenEdits If true , attempt to skip over damaged parts of + * the input stream, rather than throwing an IOException +- * @ return the operation read from the stream, or null at the end of the file +- * @ throws IOException on error. ++ * @ return the operation read from the stream, or null at the end of the ++ * file ++ * @ throws IOException on error. This function should only throw an ++ * exception when skipBrokenEdits is false . */ public FSEditLogOp readOp( boolean skipBrokenEdits) throws IOException { - FSEditLogOp op = null ; @@ -166,7 +166,10 @@ + // If we saw a terminator opcode followed by a long region of 0x00 or + // 0xff, we want to skip over that region, because there's nothing + // interesting there. -+ if (in.skip(e.getNumAfterTerminator()) < e.getNumAfterTerminator()) { ++ long numSkip = e.getNumAfterTerminator(); ++ if (in.skip(numSkip) < numSkip) { ++ FSImage.LOG.error( "Failed to skip " + numSkip + " bytes of " + ++ "garbage after an OP_INVALID. Unexpected early EOF." ); + return null ; } - return op; @@ -220,7 +223,7 @@ } } } -@@ -2306,8 +2347,10 @@ public abstract class FSEditLogOp { +@@ -2306,8 +2352,10 @@ public abstract class FSEditLogOp { } FSEditLogOpCodes opCode = FSEditLogOpCodes.fromByte(opCodeByte); @@ -232,7 +235,7 @@ FSEditLogOp op = cache.get(opCode); if (op == null ) { -@@ -2477,4 +2520,35 @@ public abstract class FSEditLogOp { +@@ -2477,4 +2525,35 @@ public abstract class FSEditLogOp { short mode = Short .valueOf(st.getValue( "MODE" )); return new PermissionStatus(username, groupname, new FsPermission(mode)); }
          Hide
          Todd Lipcon added a comment -

          Thanks, I'll run the tests locally for the changed version and commit it if they pass.

          Show
          Todd Lipcon added a comment - Thanks, I'll run the tests locally for the changed version and commit it if they pass.
          Hide
          Todd Lipcon added a comment -

          I ran all the tests locally as well as findbugs. Everything passed except for known things that always fail in my environment. I'll commit this momentarily including the delta pasted above.

          Show
          Todd Lipcon added a comment - I ran all the tests locally as well as findbugs. Everything passed except for known things that always fail in my environment. I'll commit this momentarily including the delta pasted above.
          Hide
          Todd Lipcon added a comment -

          I ran all the tests locally as well as findbugs. Everything passed except for known things that always fail in my environment. I'll commit this momentarily including the delta pasted above.

          Show
          Todd Lipcon added a comment - I ran all the tests locally as well as findbugs. Everything passed except for known things that always fail in my environment. I'll commit this momentarily including the delta pasted above.
          Hide
          Todd Lipcon added a comment -

          oh, I realized I didn't explicitly say +1. Committing momentarily to branch-2 and trunk. Thanks, Colin.

          Show
          Todd Lipcon added a comment - oh, I realized I didn't explicitly say +1. Committing momentarily to branch-2 and trunk. Thanks, Colin.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk-Commit #2318 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/2318/)
          HDFS-3335. check for edit log corruption at the end of the log. Contributed by Colin Patrick McCabe. (Revision 1338492)

          Result = SUCCESS
          todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1338492
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogFileInputStream.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogFileOutputStream.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogLoader.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogOp.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSImage.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineEditsViewer/OfflineEditsLoader.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestEditLog.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestEditLogFileOutputStream.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestNameNodeRecovery.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #2318 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/2318/ ) HDFS-3335 . check for edit log corruption at the end of the log. Contributed by Colin Patrick McCabe. (Revision 1338492) Result = SUCCESS todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1338492 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogFileInputStream.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogFileOutputStream.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogLoader.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogOp.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSImage.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineEditsViewer/OfflineEditsLoader.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestEditLog.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestEditLogFileOutputStream.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestNameNodeRecovery.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Common-trunk-Commit #2244 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/2244/)
          HDFS-3335. check for edit log corruption at the end of the log. Contributed by Colin Patrick McCabe. (Revision 1338492)

          Result = SUCCESS
          todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1338492
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogFileInputStream.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogFileOutputStream.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogLoader.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogOp.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSImage.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineEditsViewer/OfflineEditsLoader.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestEditLog.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestEditLogFileOutputStream.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestNameNodeRecovery.java
          Show
          Hudson added a comment - Integrated in Hadoop-Common-trunk-Commit #2244 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/2244/ ) HDFS-3335 . check for edit log corruption at the end of the log. Contributed by Colin Patrick McCabe. (Revision 1338492) Result = SUCCESS todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1338492 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogFileInputStream.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogFileOutputStream.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogLoader.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogOp.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSImage.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineEditsViewer/OfflineEditsLoader.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestEditLog.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestEditLogFileOutputStream.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestNameNodeRecovery.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk-Commit #2260 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/2260/)
          HDFS-3335. check for edit log corruption at the end of the log. Contributed by Colin Patrick McCabe. (Revision 1338492)

          Result = FAILURE
          todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1338492
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogFileInputStream.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogFileOutputStream.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogLoader.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogOp.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSImage.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineEditsViewer/OfflineEditsLoader.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestEditLog.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestEditLogFileOutputStream.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestNameNodeRecovery.java
          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #2260 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/2260/ ) HDFS-3335 . check for edit log corruption at the end of the log. Contributed by Colin Patrick McCabe. (Revision 1338492) Result = FAILURE todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1338492 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogFileInputStream.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/EditLogFileOutputStream.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogLoader.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogOp.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSImage.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineEditsViewer/OfflineEditsLoader.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestEditLog.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestEditLogFileOutputStream.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestNameNodeRecovery.java

            People

            • Assignee:
              Colin Patrick McCabe
              Reporter:
              Colin Patrick McCabe
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development