Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-1921

Save namespace can cause NN to be unable to come up on restart

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.22.0, 0.23.0
    • Fix Version/s: 0.22.0, 0.23.0
    • Component/s: None
    • Labels:
      None

      Description

      I discovered this in the course of trying to implement a fix for HDFS-1505.

      Per the comment for FSImage.saveNamespace(...), the algorithm for save namespace proceeds in the following order:

      1. rename current to lastcheckpoint.tmp for all of them,
      2. save image and recreate edits for all of them,
      3. rename lastcheckpoint.tmp to previous.checkpoint.

      The problem is that step 3 occurs regardless of whether or not an error occurs for all storage directories in step 2. Upon restart, the NN will see non-existent or corrupt current directories, and no lastcheckpoint.tmp directories, and so will conclude that the storage directories are not formatted.

      This issue appears to be present on both 0.22 and 0.23. This should arguably be a 0.22/0.23 blocker.

      1. hdfs1921_v23.patch
        3 kB
        Matt Foley
      2. hdfs-1505-1-test.txt
        3 kB
        Matt Foley
      3. hdfs1921_v23.patch
        3 kB
        Matt Foley
      4. hdfs-1921.txt
        5 kB
        Todd Lipcon
      5. hdfs-1921-2.patch
        5 kB
        Matt Foley
      6. hdfs-1921-2_v22.patch
        5 kB
        Matt Foley

        Issue Links

          Activity

          Arun C Murthy made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-22-branch #61 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-22-branch/61/)

          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-22-branch #61 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-22-branch/61/ )
          Matt Foley made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Hide
          Matt Foley added a comment -

          Committed to v22. Thanks, Aaron and Todd, for the reviews and help!

          Show
          Matt Foley added a comment - Committed to v22. Thanks, Aaron and Todd, for the reviews and help!
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #673 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk/673/)

          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #673 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk/673/ )
          Hide
          Matt Foley added a comment -

          That's the theory. I'll try it out.

          Show
          Matt Foley added a comment - That's the theory. I'll try it out.
          Hide
          Todd Lipcon added a comment -

          +1. Matt, want to commit this? (your account's all set, now, right?)

          Show
          Todd Lipcon added a comment - +1. Matt, want to commit this? (your account's all set, now, right?)
          Hide
          Matt Foley added a comment -

          Thank you, Aaron!

          If someone would like to +1, we can commit...

          Show
          Matt Foley added a comment - Thank you, Aaron! If someone would like to +1, we can commit...
          Hide
          Aaron T. Myers added a comment -

          Sure, Matt. Here's the output from test-patch on branch-0.22:

          +1 overall.  
          
              +1 @author.  The patch does not contain any @author tags.
          
              +1 tests included.  The patch appears to include 3 new or modified tests.
          
              +1 javadoc.  The javadoc tool did not generate any warning messages.
          
              +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
          
              +1 findbugs.  The patch does not introduce any new Findbugs warnings.
          
              +1 release audit.  The applied patch does not increase the total number of release audit warnings.
          
              +1 system test framework.  The patch passed system test framework compile.
          
          Show
          Aaron T. Myers added a comment - Sure, Matt. Here's the output from test-patch on branch-0.22: +1 overall. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 system test framework. The patch passed system test framework compile.
          Matt Foley made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12479653/hdfs-1921-2_v22.patch
          against trunk revision 1124364.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/564//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12479653/hdfs-1921-2_v22.patch against trunk revision 1124364. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/564//console This message is automatically generated.
          Matt Foley made changes -
          Attachment hdfs-1921-2_v22.patch [ 12479653 ]
          Hide
          Matt Foley added a comment -

          Here's the version for v22.
          Aaron and Todd, can one of you please run test-patch against this? I'm having trouble with my v22 build environment. Thanks.

          Turning off Hudson auto-build, since not for trunk.

          Show
          Matt Foley added a comment - Here's the version for v22. Aaron and Todd, can one of you please run test-patch against this? I'm having trouble with my v22 build environment. Thanks. Turning off Hudson auto-build, since not for trunk.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk-Commit #667 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/667/)
          HDFS-1921. saveNamespace can cause NN to be unable to come up on restart. Contributed by Matt Foley.

          todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1124364
          Files :

          • /hadoop/hdfs/trunk/src/test/hdfs/org/apache/hadoop/hdfs/server/namenode/TestSaveNamespace.java
          • /hadoop/hdfs/trunk/src/java/org/apache/hadoop/hdfs/server/namenode/FSImage.java
          • /hadoop/hdfs/trunk/CHANGES.txt
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #667 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/667/ ) HDFS-1921 . saveNamespace can cause NN to be unable to come up on restart. Contributed by Matt Foley. todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1124364 Files : /hadoop/hdfs/trunk/src/test/hdfs/org/apache/hadoop/hdfs/server/namenode/TestSaveNamespace.java /hadoop/hdfs/trunk/src/java/org/apache/hadoop/hdfs/server/namenode/FSImage.java /hadoop/hdfs/trunk/CHANGES.txt
          Hide
          Matt Foley added a comment -

          Underway.

          Show
          Matt Foley added a comment - Underway.
          Hide
          Todd Lipcon added a comment -

          Committed to trunk. Matt, can you provide a patch for 0.22?

          Show
          Todd Lipcon added a comment - Committed to trunk. Matt, can you provide a patch for 0.22?
          Hide
          Todd Lipcon added a comment -

          +1. failing tests are unrelated

          Show
          Todd Lipcon added a comment - +1. failing tests are unrelated
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12479614/hdfs-1921-2.patch
          against trunk revision 1104649.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.cli.TestHDFSCLI
          org.apache.hadoop.hdfs.server.datanode.TestBlockRecovery
          org.apache.hadoop.hdfs.TestDFSStorageStateRecovery
          org.apache.hadoop.hdfs.TestFileConcurrentReader
          org.apache.hadoop.tools.TestJMXGet

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/557//testReport/
          Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/557//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/557//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12479614/hdfs-1921-2.patch against trunk revision 1104649. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.cli.TestHDFSCLI org.apache.hadoop.hdfs.server.datanode.TestBlockRecovery org.apache.hadoop.hdfs.TestDFSStorageStateRecovery org.apache.hadoop.hdfs.TestFileConcurrentReader org.apache.hadoop.tools.TestJMXGet +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/557//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/557//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/557//console This message is automatically generated.
          Matt Foley made changes -
          Attachment hdfs-1921-2.patch [ 12479614 ]
          Hide
          Matt Foley added a comment -

          @Todd: nice tweak to the unit test. I changed the name of the subroutine to "doTestFailedSaveNamespace", since it isn't a test case in its own right.

          @Suresh:

          Code of thread starting logic is duplicated. It could be added to a separate method.

          Sounded right, so I implemented the suggestion, and then concluded it made the code more complex instead of better, because of the way it worked out with the try/catch context and the management of the errorSDs list.

          Also continue in catch block is redundant.

          The "continue"s are there for defensive coding: If someone adds statements after the catch context, but within the loop, I believe the catch context should go to the next loop iteration immediately.

          .bq Minor: per the coding guidelines please add { } after if statements.
          Done, thanks.

          One more time

          Show
          Matt Foley added a comment - @Todd: nice tweak to the unit test. I changed the name of the subroutine to "doTestFailedSaveNamespace", since it isn't a test case in its own right. @Suresh: Code of thread starting logic is duplicated. It could be added to a separate method. Sounded right, so I implemented the suggestion, and then concluded it made the code more complex instead of better, because of the way it worked out with the try/catch context and the management of the errorSDs list. Also continue in catch block is redundant. The "continue"s are there for defensive coding: If someone adds statements after the catch context, but within the loop, I believe the catch context should go to the next loop iteration immediately. .bq Minor: per the coding guidelines please add { } after if statements. Done, thanks. One more time
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12479546/hdfs-1921.txt
          against trunk revision 1104649.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.hdfs.TestDFSStorageStateRecovery
          org.apache.hadoop.hdfs.TestFileConcurrentReader
          org.apache.hadoop.tools.TestJMXGet

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/553//testReport/
          Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/553//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/553//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12479546/hdfs-1921.txt against trunk revision 1104649. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.hdfs.TestDFSStorageStateRecovery org.apache.hadoop.hdfs.TestFileConcurrentReader org.apache.hadoop.tools.TestJMXGet +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/553//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/553//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/553//console This message is automatically generated.
          Hide
          Aaron T. Myers added a comment -

          +1. The latest patch, with the merged test case, looks good to me.

          Show
          Aaron T. Myers added a comment - +1. The latest patch, with the merged test case, looks good to me.
          Todd Lipcon made changes -
          Attachment hdfs-1921.txt [ 12479546 ]
          Hide
          Todd Lipcon added a comment -

          Merged the patch that has the fix with the old test case referenced from HDFS-1505. Verified that the test fails without the fix and passes with it. Matt, can you sanity check this?

          Show
          Todd Lipcon added a comment - Merged the patch that has the fix with the old test case referenced from HDFS-1505 . Verified that the test fails without the fix and passes with it. Matt, can you sanity check this?
          Hide
          Matt Foley added a comment -

          Dmytro, since this is a mod of HDFS-1071, would you like to review it?
          It's short Thanks, if you have time.

          Show
          Matt Foley added a comment - Dmytro, since this is a mod of HDFS-1071 , would you like to review it? It's short Thanks, if you have time.
          Hide
          Suresh Srinivas added a comment -

          Looks good.

          Comments:

          1. Code of thread starting logic is duplicated. It could be added to a separate method. Also continue in catch block is redundant.
          2. Minor: per the coding guidelines please add { } after if statements.
          Show
          Suresh Srinivas added a comment - Looks good. Comments: Code of thread starting logic is duplicated. It could be added to a separate method. Also continue in catch block is redundant. Minor: per the coding guidelines please add { } after if statements.
          Hide
          Todd Lipcon added a comment -

          Is there any serious clash with patches already done for HDFS-1073? Thanks

          Yes, this will most likely clash with 1073. If you want to commit with trunk, that's fine - the next time I merge 1073 with trunk, I'll probably just add a TODO task for that branch to make sure that we haven't regressed this behavior.

          Show
          Todd Lipcon added a comment - Is there any serious clash with patches already done for HDFS-1073 ? Thanks Yes, this will most likely clash with 1073. If you want to commit with trunk, that's fine - the next time I merge 1073 with trunk, I'll probably just add a TODO task for that branch to make sure that we haven't regressed this behavior.
          Hide
          Matt Foley added a comment -

          None of the test errors are related to this patch (all four are recurring; see HDFS-1852).
          I agree with Aaron that his new unit test for HDFS-1505 is a good test for this patch too, so no additional unit tests needed (but the core of that unit test is attached to this Jira, and passes local testing).

          Show
          Matt Foley added a comment - None of the test errors are related to this patch (all four are recurring; see HDFS-1852 ). I agree with Aaron that his new unit test for HDFS-1505 is a good test for this patch too, so no additional unit tests needed (but the core of that unit test is attached to this Jira, and passes local testing).
          Hide
          Matt Foley added a comment -

          Todd and Aaron, regarding v23: I understand this may be modified by work in HDFS-1073, but until HDFS-1073 is ready to come out I'd like to keep trunk as clean as possible. So I think this patch should go into both v22 and v23. Is there any serious clash with patches already done for HDFS-1073? Thanks.

          Show
          Matt Foley added a comment - Todd and Aaron, regarding v23: I understand this may be modified by work in HDFS-1073 , but until HDFS-1073 is ready to come out I'd like to keep trunk as clean as possible. So I think this patch should go into both v22 and v23. Is there any serious clash with patches already done for HDFS-1073 ? Thanks.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12479025/hdfs1921_v23.patch
          against trunk revision 1102513.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.cli.TestHDFSCLI
          org.apache.hadoop.hdfs.TestDFSStorageStateRecovery
          org.apache.hadoop.hdfs.TestFileConcurrentReader
          org.apache.hadoop.tools.TestJMXGet

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/512//testReport/
          Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/512//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/512//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12479025/hdfs1921_v23.patch against trunk revision 1102513. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.cli.TestHDFSCLI org.apache.hadoop.hdfs.TestDFSStorageStateRecovery org.apache.hadoop.hdfs.TestFileConcurrentReader org.apache.hadoop.tools.TestJMXGet +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/512//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/512//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/512//console This message is automatically generated.
          Matt Foley made changes -
          Attachment hdfs1921_v23.patch [ 12479025 ]
          Hide
          Matt Foley added a comment -

          resubmitting the patch file in case Hudson got confused by the ordering.

          Show
          Matt Foley added a comment - resubmitting the patch file in case Hudson got confused by the ordering.
          Matt Foley made changes -
          Attachment hdfs-1505-1-test.txt [ 12479024 ]
          Hide
          Matt Foley added a comment -

          Here's the modified form of the test that works - there was a glitch in spy storage setup. The test passes.

          Show
          Matt Foley added a comment - Here's the modified form of the test that works - there was a glitch in spy storage setup. The test passes.
          Matt Foley made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Matt Foley made changes -
          Attachment hdfs1921_v23.patch [ 12479019 ]
          Hide
          Matt Foley added a comment -

          Here's a patch for trunk, so it will run under auto-test. I'll post the v22 version when it passes.

          The HDFS-1505 test case should work if this patch is added. Can you please try it, as I was getting a failure to unlock the storage dir upon FSNamesystem.close().

          Show
          Matt Foley added a comment - Here's a patch for trunk, so it will run under auto-test. I'll post the v22 version when it passes. The HDFS-1505 test case should work if this patch is added. Can you please try it, as I was getting a failure to unlock the storage dir upon FSNamesystem.close().
          Hide
          Aaron T. Myers added a comment -

          Also, I should mention that there's a test case posted on HDFS-1505 which will illustrate this case.

          Show
          Aaron T. Myers added a comment - Also, I should mention that there's a test case posted on HDFS-1505 which will illustrate this case.
          Aaron T. Myers made changes -
          Link This issue relates to HDFS-1896 [ HDFS-1896 ]
          Hide
          Aaron T. Myers added a comment -

          Hey Matt, that's great news. Thanks for picking this up.

          I just talked to Todd, and he agrees that this code will be superseded in 0.23 by the work that's going on HDFS-1073. So, I think it's reasonable to only work on a patch for 0.22 as part of this JIRA.

          Show
          Aaron T. Myers added a comment - Hey Matt, that's great news. Thanks for picking this up. I just talked to Todd, and he agrees that this code will be superseded in 0.23 by the work that's going on HDFS-1073 . So, I think it's reasonable to only work on a patch for 0.22 as part of this JIRA.
          Hide
          Matt Foley added a comment -

          I will propose a patch for this, unless Dmytro wants it.

          Show
          Matt Foley added a comment - I will propose a patch for this, unless Dmytro wants it.
          Matt Foley made changes -
          Assignee Matt Foley [ mattf ]
          Todd Lipcon made changes -
          Field Original Value New Value
          Priority Critical [ 2 ] Blocker [ 1 ]
          Hide
          Todd Lipcon added a comment -

          I agree this is a blocker

          Show
          Todd Lipcon added a comment - I agree this is a blocker
          Aaron T. Myers created issue -

            People

            • Assignee:
              Matt Foley
              Reporter:
              Aaron T. Myers
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development