Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.20.1, 0.21.0, 0.22.0
    • Fix Version/s: 0.20.3
    • Component/s: namenode
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      This is a continuation of a discussion from HDFS-909. The FSImage.saveFSImage function (implementing dfsadmin -saveNamespace) can corrupt the NN storage such that all current edits are lost.

      1. saveNamespace-0.21.patch
        34 kB
        Konstantin Shvachko
      2. saveNamespace-0.21.patch
        41 kB
        Konstantin Shvachko
      3. saveNamespace-0.20.patch
        30 kB
        Konstantin Shvachko
      4. saveNamespace-0.20.patch
        37 kB
        Konstantin Shvachko
      5. saveNamespace.txt
        33 kB
        Konstantin Shvachko
      6. saveNamespace.patch
        34 kB
        Konstantin Shvachko
      7. saveNamespace.patch
        38 kB
        Konstantin Shvachko
      8. saveNamespace.patch
        41 kB
        Konstantin Shvachko
      9. saveNamespace.patch
        41 kB
        Konstantin Shvachko
      10. PurgeEditsBeforeImageSave.patch
        3 kB
        Konstantin Shvachko
      11. hdfs-955-unittest.txt
        5 kB
        Todd Lipcon
      12. hdfs-955-moretests.txt
        16 kB
        Todd Lipcon
      13. FSStateTransition7.htm
        119 kB
        Konstantin Shvachko

        Issue Links

          Activity

          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk-Commit #231 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/231/)
          . New implementation of saveNamespace() to avoid loss of edits when name-node fails during saving. Contributed by Konstantin Shvachko.

          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #231 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/231/ ) . New implementation of saveNamespace() to avoid loss of edits when name-node fails during saving. Contributed by Konstantin Shvachko.
          Hide
          Konstantin Shvachko added a comment -

          I just committed this.

          Show
          Konstantin Shvachko added a comment - I just committed this.
          Hide
          Konstantin Shvachko added a comment -

          Here are the new patches for 0.21 and 0.20 branches.

          Show
          Konstantin Shvachko added a comment - Here are the new patches for 0.21 and 0.20 branches.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12440629/saveNamespace.patch
          against trunk revision 930967.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 11 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/148/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/148/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/148/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/148/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12440629/saveNamespace.patch against trunk revision 930967. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 11 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/148/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/148/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/148/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/148/console This message is automatically generated.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12440629/saveNamespace.patch
          against trunk revision 929406.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 11 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/297/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/297/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/297/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/297/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12440629/saveNamespace.patch against trunk revision 929406. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 11 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/297/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/297/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/297/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/297/console This message is automatically generated.
          Hide
          Suresh Srinivas added a comment -

          +1 for the patch. BTW thanks for documenting various directories under storage directories. It adds more clarity to the code, save namespace, upgrade mechanisms.

          Show
          Suresh Srinivas added a comment - +1 for the patch. BTW thanks for documenting various directories under storage directories. It adds more clarity to the code, save namespace, upgrade mechanisms.
          Hide
          Konstantin Shvachko added a comment -

          > TestSaveNamespace - should the switch cases for MOVE_CURRENT and MOVE_LAST_CHECKPOINT be switched?

          Got it, this is what you meant by switching.

          Show
          Konstantin Shvachko added a comment - > TestSaveNamespace - should the switch cases for MOVE_CURRENT and MOVE_LAST_CHECKPOINT be switched? Got it, this is what you meant by switching.
          Hide
          Konstantin Shvachko added a comment -

          > loadFSImage() - why do we need check for directory type, while comparing latest checkpoint times?

          IMAGE latest checkpoint can be larger than EDITS latest checkpoint only if the two directories are image-only and edits-only respectively. If this is one directory which is IMAGE_AND_EDITS then the condition should be treated as an error. This could be an assert in current code, but being overprotective seems to be appropriate here.

          > saveCurrent(), moveCurrent(), moveLastCheckpoint() could all be static. Better still, they should all be methods in StorageDirectory.

          saveCurrent() cannot be static, it uses editLog member. All three methods are very much name-node specific, while StorageDirectory is intended to have methods common for all storage types data-node or name-node. So I decided to keep them in FSImage.

          > TestSaveNamespace - we should add other tests

          I believe these cases are covered in updated TestDFSStorageStateRecovery. The test emulates failures on different stages by creating corresponding storage directory configurations and verifies that the nodes recover or fail correctly.

          The rest is corrected / updated.
          Suresh, thank you for the meticulous review.

          Show
          Konstantin Shvachko added a comment - > loadFSImage() - why do we need check for directory type, while comparing latest checkpoint times? IMAGE latest checkpoint can be larger than EDITS latest checkpoint only if the two directories are image-only and edits-only respectively. If this is one directory which is IMAGE_AND_EDITS then the condition should be treated as an error. This could be an assert in current code, but being overprotective seems to be appropriate here. > saveCurrent(), moveCurrent(), moveLastCheckpoint() could all be static. Better still, they should all be methods in StorageDirectory. saveCurrent() cannot be static, it uses editLog member. All three methods are very much name-node specific, while StorageDirectory is intended to have methods common for all storage types data-node or name-node. So I decided to keep them in FSImage. > TestSaveNamespace - we should add other tests I believe these cases are covered in updated TestDFSStorageStateRecovery. The test emulates failures on different stages by creating corresponding storage directory configurations and verifies that the nodes recover or fail correctly. The rest is corrected / updated. Suresh, thank you for the meticulous review.
          Hide
          Suresh Srinivas added a comment -
          1. FSImage.java
            • saveNameSpace() - in method comments, instead of "descrepancy between directory states", some thing like "in order help recovery in case failure during saveNamespace"
            • typo onle
            • loadFSImage() - why do we need check for directory type, while comparing latest checkpoint times?
            • loadFSImage() - instead of "after saving some of the images", to "after saving images in some of the storage directories" to avoid interpretation that partial image is saved.
            • saveCurrent(), moveCurrent(), moveLastCheckpoint() could all be static. Better still, they should all be methods in StorageDirectory.
          2. Storage.java
            • if (!hasCheckpointTmp) will always be true.
            • This might be a good time to document all the the get***Dir() methods
            • analyzestorage() - optional - can you add IOException after throws to fix javadoc warning
          3. TestSaveNamespace - should the switch cases for MOVE_CURRENT and MOVE_LAST_CHECKPOINT be switched?
          4. TestSaveNamespace - we should add other tests where save image could fail for first dir also. Also adding failing at different iterations of moveLastCheckPoint() and moveCurrent()?
          Show
          Suresh Srinivas added a comment - FSImage.java saveNameSpace() - in method comments, instead of "descrepancy between directory states", some thing like "in order help recovery in case failure during saveNamespace" typo onle loadFSImage() - why do we need check for directory type, while comparing latest checkpoint times? loadFSImage() - instead of "after saving some of the images", to "after saving images in some of the storage directories" to avoid interpretation that partial image is saved. saveCurrent(), moveCurrent(), moveLastCheckpoint() could all be static. Better still, they should all be methods in StorageDirectory. Storage.java if (!hasCheckpointTmp) will always be true. This might be a good time to document all the the get***Dir() methods analyzestorage() - optional - can you add IOException after throws to fix javadoc warning TestSaveNamespace - should the switch cases for MOVE_CURRENT and MOVE_LAST_CHECKPOINT be switched? TestSaveNamespace - we should add other tests where save image could fail for first dir also. Also adding failing at different iterations of moveLastCheckPoint() and moveCurrent()?
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12440371/saveNamespace.patch
          against trunk revision 929406.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 11 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/289/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/289/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/289/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/289/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12440371/saveNamespace.patch against trunk revision 929406. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 11 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/289/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/289/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/289/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/289/console This message is automatically generated.
          Hide
          Konstantin Shvachko added a comment -

          New patch addresses Suresh's comments.

          Show
          Konstantin Shvachko added a comment - New patch addresses Suresh's comments.
          Hide
          Suresh Srinivas added a comment -

          With the new solution, I see the following issues:

          1. The order in which namespace is saved is still important. It has to be image followed by edits. If edits is emptied first, during recovery either edits is lost or system will not start due to image time not matching with that of edits.
          2. On a failure after saving image and before edits is emptied, during recovery, the system may not start due to mismatch between fsimage and edits time. However, there is no loss of data in this case.
          Show
          Suresh Srinivas added a comment - With the new solution, I see the following issues: The order in which namespace is saved is still important. It has to be image followed by edits. If edits is emptied first, during recovery either edits is lost or system will not start due to image time not matching with that of edits. On a failure after saving image and before edits is emptied, during recovery, the system may not start due to mismatch between fsimage and edits time. However, there is no loss of data in this case.
          Hide
          Konstantin Shvachko added a comment -

          A patch for 0.20 branch.

          Show
          Konstantin Shvachko added a comment - A patch for 0.20 branch.
          Hide
          Konstantin Shvachko added a comment -

          This patch fixes the JavaDoc warning.
          Don't know why Hundson complains about test failures. I don't see any failed tests on the Test results page.
          The patch for 0.21 is mostly the same as for trunk. except for a couple of lines.

          Show
          Konstantin Shvachko added a comment - This patch fixes the JavaDoc warning. Don't know why Hundson complains about test failures. I don't see any failed tests on the Test results page. The patch for 0.21 is mostly the same as for trunk. except for a couple of lines.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12439850/saveNamespace.txt
          against trunk revision 927999.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 11 new or modified tests.

          -1 javadoc. The javadoc tool appears to have generated 1 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/136/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/136/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/136/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/136/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12439850/saveNamespace.txt against trunk revision 927999. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 11 new or modified tests. -1 javadoc. The javadoc tool appears to have generated 1 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/136/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/136/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/136/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/136/console This message is automatically generated.
          Hide
          Konstantin Shvachko added a comment -

          Also I did not change the LAYOUT_VERSION because there no change in any layouts. I reused directories lastcheckpoint.tmp and previous.checkpoint that already existed for making checkpoints.

          Show
          Konstantin Shvachko added a comment - Also I did not change the LAYOUT_VERSION because there no change in any layouts. I reused directories lastcheckpoint.tmp and previous.checkpoint that already existed for making checkpoints.
          Hide
          Konstantin Shvachko added a comment -

          This patch fixes the problem. The algorithm is as follows:

          1. if current exists, then rename current to lastcheckpoint.tmp
          2. create new current directory, and save new image into it
          3. remove previous.checkpoint if exists
          4. rename lastcheckpoint.tmp to previous.checkpoint

          I added the new recovery cases to TestDFSStorageStateRecovery.
          I modified Todd's test to reflect the new code. I renamed saveFSImage() to saveNamespace(), because it saves both fsimafe and edits.

          Show
          Konstantin Shvachko added a comment - This patch fixes the problem. The algorithm is as follows: if current exists, then rename current to lastcheckpoint.tmp create new current directory, and save new image into it remove previous.checkpoint if exists rename lastcheckpoint.tmp to previous.checkpoint I added the new recovery cases to TestDFSStorageStateRecovery. I modified Todd's test to reflect the new code. I renamed saveFSImage() to saveNamespace(), because it saves both fsimafe and edits.
          Hide
          Konstantin Shvachko added a comment -

          The highlighted (yellow) text explains how saveNamespace() works and what is the recovery procedure. The recovery is a common procedure, which includes recovery from failed upgrades (taking snapshots), that is why I updated this document under HADOOP-702 and placed it her.

          Show
          Konstantin Shvachko added a comment - The highlighted (yellow) text explains how saveNamespace() works and what is the recovery procedure. The recovery is a common procedure, which includes recovery from failed upgrades (taking snapshots), that is why I updated this document under HADOOP-702 and placed it her.
          Hide
          Todd Lipcon added a comment -

          Sounds good to me. Looking forward to your patch. Thanks Konstantin.

          Show
          Todd Lipcon added a comment - Sounds good to me. Looking forward to your patch. Thanks Konstantin.
          Hide
          Konstantin Shvachko added a comment -

          Unfortunately, this solution does not work either. The problem is that it assumes that all files are in the same directory, while in our model edits and image directories may be independent of each other. It means that we cannot rely on the presence or absence of EDITS_NEW (and EDITS) in order to decide whether to remove or promote IMAGE_NEW, because the system can dye when EDITS_NEW is renamed to EDITS in one directory but not in an other. We are trying here to restore the stage of the NN storage transformation sequence, when it crashed, by examining the remaining files. This is error-prone, and introduces unnecessary complexity. We should rather apply the technique used in BackupNode and for the upgrade.

          A Better Solution

          The idea is to create a temporary directory and accumulate all necessary changes to the persistent data in it, and then rename it to current once the new data is ready. The rename is two-step, not atomic, but it minimizes the recovery effort. Here is how saveFSImage() should work.

          1. Create prospective_current.tmp, and write necessary files in it.
            • Save new image into prospective_current.tmp/IMAGE
            • Create empty prospective_current.tmp/EDITS
            • Create VERSION and fstime files in prospective_current.tmp and write new checkpointTime.
          2. Rename current to removed_current.tmp
          3. Rename prospective_current.tmp to current
          4. Remove removed_current.tmp

          And the recovery procedure is very simple:

          • if current.exists && prospective_current.tmp.exists then remove prospective_current.tmp
          • if ! current.exists && prospective_current.tmp.exists then rename prospective_current.tmp to current and remove removed_current.tmp

          It is important that image and edits directories are operated (created and recovered) independently of each other, but maintain the same meta-data state.
          I plan to implement this algorithm, and will try to reuse some code from BN.
          I will not change the checkpoint procedure for SNN, since it is deprecated, and it should not cause problems, as

          • Checkpoint cannot start when saveFSImage is in progress.
          • If checkpoint image upload started before saveFSImage, then the uploading will continue to current, and further rollFSImage will fail either because the NN is in safe mode (saveFSImage is still in progress) or because EDITS_NEW does not exist anymore (saveFSImage already completed).
          Show
          Konstantin Shvachko added a comment - Unfortunately, this solution does not work either. The problem is that it assumes that all files are in the same directory, while in our model edits and image directories may be independent of each other. It means that we cannot rely on the presence or absence of EDITS_NEW (and EDITS) in order to decide whether to remove or promote IMAGE_NEW, because the system can dye when EDITS_NEW is renamed to EDITS in one directory but not in an other. We are trying here to restore the stage of the NN storage transformation sequence, when it crashed, by examining the remaining files. This is error-prone, and introduces unnecessary complexity. We should rather apply the technique used in BackupNode and for the upgrade. A Better Solution The idea is to create a temporary directory and accumulate all necessary changes to the persistent data in it, and then rename it to current once the new data is ready. The rename is two-step, not atomic, but it minimizes the recovery effort. Here is how saveFSImage() should work. Create prospective_current.tmp, and write necessary files in it. Save new image into prospective_current.tmp/IMAGE Create empty prospective_current.tmp/EDITS Create VERSION and fstime files in prospective_current.tmp and write new checkpointTime. Rename current to removed_current.tmp Rename prospective_current.tmp to current Remove removed_current.tmp And the recovery procedure is very simple: if current.exists && prospective_current.tmp.exists then remove prospective_current.tmp if ! current.exists && prospective_current.tmp.exists then rename prospective_current.tmp to current and remove removed_current.tmp It is important that image and edits directories are operated (created and recovered) independently of each other, but maintain the same meta-data state. I plan to implement this algorithm, and will try to reuse some code from BN. I will not change the checkpoint procedure for SNN, since it is deprecated, and it should not cause problems, as Checkpoint cannot start when saveFSImage is in progress. If checkpoint image upload started before saveFSImage, then the uploading will continue to current, and further rollFSImage will fail either because the NN is in safe mode (saveFSImage is still in progress) or because EDITS_NEW does not exist anymore (saveFSImage already completed).
          Hide
          Todd Lipcon added a comment -

          If I'm reading correctly, your solution is the same as my Option #4 above right?

          Are you already working on this or should I start work on a patch that implements this option?

          Show
          Todd Lipcon added a comment - If I'm reading correctly, your solution is the same as my Option #4 above right? Are you already working on this or should I start work on a patch that implements this option?
          Hide
          Konstantin Shvachko added a comment -

          The Problem

          Our recovery logic for IMAGE_NEW file was originally intended for the checkpoint recovery, and it works in this case. But it does not work for recovery from a saveFSImage() failure.

          The storage directory may contain four files: IMAGE, EDITS, EDITS_NEW, and IMAGE_NEW.
          Here are the steps we perform during checkpoint:
          0. Initially storage directory has IMAGE and EDITS files only.
          1. Start checkpoint. NN creates EDITS_NEW, and starts streaming edits into it.
          2. Upload IMAGE_NEW from SNN to NN storage directory.
          3. When upload is done, rename EDITS_NEW -> EDITS.
          4. Rename IMAGE_NEW -> IMAGE. Back to the initial state.

          Here is the time-line of which combination of files represent the current state of the file system relative to the events above.

          IMAGE + EDITS — (1) — IMAGE + EDITS + EDITS_NEW — (2) — IMAGE + EDITS + EDITS_NEW — (3) — IMAGE_NEW + EDITS — (4) — IMAGE + EDITS

          The recovery procedure:

          • If EDITS_NEW.exists, then we know NN failed after 1 or 2, but before 3, and our recovery strategy is to discard IMAGE_NEW.
          • If ! EDITS_NEW.exists && IMAGE_NEW.exists, then NN failed after 3, but before 4, and we recover by upgrading IMAGE_NEW to IMAGE.

          Now lets see what happens when we save image during startup or saveNamespace.
          Here are the steps we perform when we call saveFSImage():
          0. Initially the storage directory has IMAGE, EDITS, and potentially EDITS_NEW, which have been all loaded and digested in NN RAM.
          1. Create EDITS_NEW if missing.
          2. Save IMAGE_NEW.
          3. Empty EDITS and EDITS_NEW.
          4. Rename EDITS_NEW -> EDITS.
          5. Rename IMAGE_NEW -> IMAGE.

          We use the same recovery procedure here as in checkpointing, which leads to a data loss in the following failure scenario.
          If we fail after 3 but before 4, then we will discard IMAGE_NEW, because EDITS_NEW.exists.
          But the latest updates in EDITS and/or EDITS_NEW has already been wiped out and we loose these edits forever.

          The main reason the checkpointing logic does not work for saving is that IMAGE_NEW has different semantics in these two cases.

          • In checkpoint IMAGE_NEW = IMAGE + EDITS
          • In saveFSImage IMAGE_NEW = IMAGE + EDITS + EDITS_NEW

          The Solution

          Different images should be represented by separate files and treated differently. I'll denote them

          • IMAGE_CKPT = IMAGE + EDITS the checkpoint image
          • IMAGE_LAST = IMAGE + EDITS + EDITS_NEW the last saved image

          So the checkpoint process will create IMAGE_CKPT and will work with is exactly as before, no changes here.

          saveFSImage will save NN's memory state into IMAGE_LAST, and should consist of the following steps:
          0. Initially the storage directory has IMAGE, EDITS, and potentially EDITS_NEW and IMAGE_CKPT.
          1. Save image into IMAGE_LAST.
          2. Remove EDITS, IMAGE_CKPT, and EDITS_NEW - in the order listed.
          3. Rename IMAGE_LAST -> IMAGE.
          4. Create empty EDITS.

          It is important to note that checkpoint cannot start once saveFSImage started, because NN is in safe mode, and because it holds the NN lock. If the upload of IMAGE_CKPT has started (stage c-2) it will proceed concurrently with the save. But rollEdits() (stage c-3) will fail if called during saveFSImage.

          Here is the time-line of which combination of files represent the current state of the file system relative to the events above.

          IMAGE + EDITS + EDITS_NEW — (1) — IMAGE + EDITS + EDITS_NEW — (2) — IMAGE_LAST — (3) — IMAGE — (4) — IMAGE + EDITS

          The recovery procedure for saving image is:

          • If EDITS.exists && IMAGE_LAST.exists, then we know NN failed after 1 but before 2, and we recover by discarding IMAGE_LAST.
          • If ! EDITS.exists && IMAGE_LAST.exists, then NN failed during or after 2, and we recover by applying 2, 3, and 4.
          • If ! EDITS.exists && ! IMAGE_LAST.exists, then NN failed after 3, and we recover by applying 4.

          There is a slight complication for WinFS. It will not let us remove IMAGE_CKPT on stage 3 if the checkpointer is still writing into it. In this case we will ignore the failure, and quit the procedure, delaing the rest of the steps for the future. The correct state (rename IMAGE_LAST to IMAGE) will be restored either when checkpoint finishes or if the NN restarts due to a failure.

          Show
          Konstantin Shvachko added a comment - The Problem Our recovery logic for IMAGE_NEW file was originally intended for the checkpoint recovery, and it works in this case. But it does not work for recovery from a saveFSImage() failure. The storage directory may contain four files: IMAGE, EDITS, EDITS_NEW, and IMAGE_NEW. Here are the steps we perform during checkpoint: 0. Initially storage directory has IMAGE and EDITS files only. 1. Start checkpoint. NN creates EDITS_NEW, and starts streaming edits into it. 2. Upload IMAGE_NEW from SNN to NN storage directory. 3. When upload is done, rename EDITS_NEW -> EDITS. 4. Rename IMAGE_NEW -> IMAGE. Back to the initial state. Here is the time-line of which combination of files represent the current state of the file system relative to the events above. IMAGE + EDITS — (1) — IMAGE + EDITS + EDITS_NEW — (2) — IMAGE + EDITS + EDITS_NEW — (3) — IMAGE_NEW + EDITS — (4) — IMAGE + EDITS The recovery procedure: If EDITS_NEW.exists, then we know NN failed after 1 or 2, but before 3, and our recovery strategy is to discard IMAGE_NEW. If ! EDITS_NEW.exists && IMAGE_NEW.exists, then NN failed after 3, but before 4, and we recover by upgrading IMAGE_NEW to IMAGE. Now lets see what happens when we save image during startup or saveNamespace. Here are the steps we perform when we call saveFSImage(): 0. Initially the storage directory has IMAGE, EDITS, and potentially EDITS_NEW, which have been all loaded and digested in NN RAM. 1. Create EDITS_NEW if missing. 2. Save IMAGE_NEW. 3. Empty EDITS and EDITS_NEW. 4. Rename EDITS_NEW -> EDITS. 5. Rename IMAGE_NEW -> IMAGE. We use the same recovery procedure here as in checkpointing, which leads to a data loss in the following failure scenario. If we fail after 3 but before 4, then we will discard IMAGE_NEW, because EDITS_NEW.exists. But the latest updates in EDITS and/or EDITS_NEW has already been wiped out and we loose these edits forever. The main reason the checkpointing logic does not work for saving is that IMAGE_NEW has different semantics in these two cases. In checkpoint IMAGE_NEW = IMAGE + EDITS In saveFSImage IMAGE_NEW = IMAGE + EDITS + EDITS_NEW The Solution Different images should be represented by separate files and treated differently. I'll denote them IMAGE_CKPT = IMAGE + EDITS the checkpoint image IMAGE_LAST = IMAGE + EDITS + EDITS_NEW the last saved image So the checkpoint process will create IMAGE_CKPT and will work with is exactly as before, no changes here. saveFSImage will save NN's memory state into IMAGE_LAST, and should consist of the following steps: 0. Initially the storage directory has IMAGE, EDITS, and potentially EDITS_NEW and IMAGE_CKPT. 1. Save image into IMAGE_LAST. 2. Remove EDITS, IMAGE_CKPT, and EDITS_NEW - in the order listed. 3. Rename IMAGE_LAST -> IMAGE. 4. Create empty EDITS. It is important to note that checkpoint cannot start once saveFSImage started, because NN is in safe mode, and because it holds the NN lock. If the upload of IMAGE_CKPT has started (stage c-2) it will proceed concurrently with the save. But rollEdits() (stage c-3) will fail if called during saveFSImage. Here is the time-line of which combination of files represent the current state of the file system relative to the events above. IMAGE + EDITS + EDITS_NEW — (1) — IMAGE + EDITS + EDITS_NEW — (2) — IMAGE_LAST — (3) — IMAGE — (4) — IMAGE + EDITS The recovery procedure for saving image is: If EDITS.exists && IMAGE_LAST.exists, then we know NN failed after 1 but before 2, and we recover by discarding IMAGE_LAST. If ! EDITS.exists && IMAGE_LAST.exists, then NN failed during or after 2, and we recover by applying 2, 3, and 4. If ! EDITS.exists && ! IMAGE_LAST.exists, then NN failed after 3, and we recover by applying 4. There is a slight complication for WinFS. It will not let us remove IMAGE_CKPT on stage 3 if the checkpointer is still writing into it. In this case we will ignore the failure, and quit the procedure, delaing the rest of the steps for the future. The correct state (rename IMAGE_LAST to IMAGE) will be restored either when checkpoint finishes or if the NN restarts due to a failure.
          Hide
          Todd Lipcon added a comment -

          If EDITS_NEW is not present, but IMAGE_NEW is, this means that the NN failure occurred after IMAGE_NEW was successfully written

          This is not necessarily the case on a lot of filesystems. As I noted in HDFS-970, delayed allocation combined with the default journaling modes in many commonly deployed filesystems means that you cannot use the existance of one file to determine whether data has been flushed in another. That is to say, some filesystems will recover the metadata operations on the EDITS files even though the data operations on IMAGE_NEW are incomplete.

          The only way we can know that IMAGE_NEW is really on disk across a variety of filesystems is to fsync it. Otherwise, when the filesystem is recovered, we could rollback to a state where the file is empty but EDITS_NEW has been removed.

          During start up the NN decides on whether to discard or to keep IMAGE_NEW (and rename it to IMAGE) based on the existence of EDITS_NEW

          I agree this is what it does. But I don't think there is then any valid rolling order that tolerates arbitrary crashes. See my discussion above

          The contents may be corrupted during failures, it is not safe to rely on reading the data from image or edits files

          It is safe if we fsync. metadata can also be corrupted (rolled back to indeterminate states) in failures. Especially with the broken way in which we currently do image replacement, I don't want to take chances here. This is best explained by the presentation linked from this post by Theodore T'so: http://www.linuxfoundation.org/news-media/blogs/browse/2009/03/don%E2%80%99t-fear-fsync

          Show
          Todd Lipcon added a comment - If EDITS_NEW is not present, but IMAGE_NEW is, this means that the NN failure occurred after IMAGE_NEW was successfully written This is not necessarily the case on a lot of filesystems. As I noted in HDFS-970 , delayed allocation combined with the default journaling modes in many commonly deployed filesystems means that you cannot use the existance of one file to determine whether data has been flushed in another. That is to say, some filesystems will recover the metadata operations on the EDITS files even though the data operations on IMAGE_NEW are incomplete. The only way we can know that IMAGE_NEW is really on disk across a variety of filesystems is to fsync it. Otherwise, when the filesystem is recovered, we could rollback to a state where the file is empty but EDITS_NEW has been removed. During start up the NN decides on whether to discard or to keep IMAGE_NEW (and rename it to IMAGE) based on the existence of EDITS_NEW I agree this is what it does. But I don't think there is then any valid rolling order that tolerates arbitrary crashes. See my discussion above The contents may be corrupted during failures, it is not safe to rely on reading the data from image or edits files It is safe if we fsync. metadata can also be corrupted (rolled back to indeterminate states) in failures. Especially with the broken way in which we currently do image replacement, I don't want to take chances here. This is best explained by the presentation linked from this post by Theodore T'so: http://www.linuxfoundation.org/news-media/blogs/browse/2009/03/don%E2%80%99t-fear-fsync
          Hide
          Konstantin Shvachko added a comment -

          >> The criteria that IMAGE_NEW was written completely and successfully is the existence of EDITS_NEW
          > I think you misspoke here - EDITS_NEW exists before IMAGE_NEW is saved.

          This is what I meant. During start up the NN decides on whether to discard or to keep IMAGE_NEW (and rename it to IMAGE) based on the existence of EDITS_NEW. If EDITS_NEW exists then it simply removes IMAGE_NEW. This means that the NN failure occurred before IMAGE_NEW was completed. If EDITS_NEW is not present, but IMAGE_NEW is, this means that the NN failure occurred after IMAGE_NEW was successfully written, and therefore the NN need just to complete the checkpoint by renaming IMAGE_NEW to IMAGE and purging edits.

          > You may be able to know that info from the state of some other files, but why not be explicit about it to avoid some classes of errors?

          We want to be able to know about failure without reading contents of the image file. The contents may be corrupted during failures, it is not safe to rely on reading the data from image or edits files.

          Show
          Konstantin Shvachko added a comment - >> The criteria that IMAGE_NEW was written completely and successfully is the existence of EDITS_NEW > I think you misspoke here - EDITS_NEW exists before IMAGE_NEW is saved. This is what I meant. During start up the NN decides on whether to discard or to keep IMAGE_NEW (and rename it to IMAGE) based on the existence of EDITS_NEW. If EDITS_NEW exists then it simply removes IMAGE_NEW. This means that the NN failure occurred before IMAGE_NEW was completed. If EDITS_NEW is not present, but IMAGE_NEW is, this means that the NN failure occurred after IMAGE_NEW was successfully written, and therefore the NN need just to complete the checkpoint by renaming IMAGE_NEW to IMAGE and purging edits. > You may be able to know that info from the state of some other files, but why not be explicit about it to avoid some classes of errors? We want to be able to know about failure without reading contents of the image file. The contents may be corrupted during failures, it is not safe to rely on reading the data from image or edits files.
          Hide
          Todd Lipcon added a comment -

          I don't understand why you mixed in this patch the code from HDFS-957. It does not help the test cases, right?

          I was just working on the two in the same tree - with a bit of modification to recoverInterruptedCheckpoint I think the second test can be fixed using the functionality from HDFS-957. I can demonstrate this with a patch if you would like.

          The criteria that IMAGE_NEW was written completely and successfully is the existence of EDITS_NEW

          I think you misspoke here - EDITS_NEW exists before IMAGE_NEW is saved. In my opinion the cleanest way of knowing that IMAGE_NEW is complete is the HDFS-957 patch. You may be able to know that info from the state of some other files, but why not be explicit about it to avoid some classes of errors?

          I don't know what you were trying to achieve with this. I don't see the expected exception thrown.

          Ah, sloppy copy paste there on my part. I don't expect an exception to actually be caught there. The failed restart with corrupted edits is indeed the failure I expected to provoke with that test.

          Show
          Todd Lipcon added a comment - I don't understand why you mixed in this patch the code from HDFS-957 . It does not help the test cases, right? I was just working on the two in the same tree - with a bit of modification to recoverInterruptedCheckpoint I think the second test can be fixed using the functionality from HDFS-957 . I can demonstrate this with a patch if you would like. The criteria that IMAGE_NEW was written completely and successfully is the existence of EDITS_NEW I think you misspoke here - EDITS_NEW exists before IMAGE_NEW is saved. In my opinion the cleanest way of knowing that IMAGE_NEW is complete is the HDFS-957 patch. You may be able to know that info from the state of some other files, but why not be explicit about it to avoid some classes of errors? I don't know what you were trying to achieve with this. I don't see the expected exception thrown. Ah, sloppy copy paste there on my part. I don't expect an exception to actually be caught there. The failed restart with corrupted edits is indeed the failure I expected to provoke with that test.
          Hide
          Konstantin Shvachko added a comment -

          Very good test cases.
          I don't understand why you mixed in this patch the code from HDFS-957. It does not help the test cases, right?

          1. testCrashWhileSavingSecondImage() passes with my fix.
          2. testCrashBeforeRollingFSImage() does not pass. The reason is that we
            • we save image into IMAGE_NEW
            • then empty EDITS and EDITS_NEW
            • then crash
              The criteria that IMAGE_NEW was written completely and successfully is the existence of EDITS_NEW. So if we try to restart NN at this point, IMAGE_NEW will be discarded (since EDITS_NEW is present) and we end up with the old IMAGE.
          3. testSaveWhileEditsRolled(). I don't know what you were trying to achieve with this. I don't see the expected exception thrown. But it fails, and the reason here is that after saveNamespace() we get a corrupted edits. It has the version bytes, but does not have the end-mark byte (OP_INVALID) in the end. I think it was not closed properly.

          Will try to fix this.

          Show
          Konstantin Shvachko added a comment - Very good test cases. I don't understand why you mixed in this patch the code from HDFS-957 . It does not help the test cases, right? testCrashWhileSavingSecondImage() passes with my fix. testCrashBeforeRollingFSImage() does not pass. The reason is that we we save image into IMAGE_NEW then empty EDITS and EDITS_NEW then crash The criteria that IMAGE_NEW was written completely and successfully is the existence of EDITS_NEW. So if we try to restart NN at this point, IMAGE_NEW will be discarded (since EDITS_NEW is present) and we end up with the old IMAGE. testSaveWhileEditsRolled(). I don't know what you were trying to achieve with this. I don't see the expected exception thrown. But it fails, and the reason here is that after saveNamespace() we get a corrupted edits. It has the version bytes, but does not have the end-mark byte (OP_INVALID) in the end. I think it was not closed properly. Will try to fix this.
          Hide
          Konstantin Shvachko added a comment -

          I'll take a look at your patch, thanks.

          Show
          Konstantin Shvachko added a comment - I'll take a look at your patch, thanks.
          Hide
          Todd Lipcon added a comment -

          Here's a patch which includes your patch from before, plus two new unit tests. Your patch fixes the original test, but not the two new ones.

          I think, unless we implement one of the proposals from my comment above, that there is no way to rearrange the current code that does this safely. I'd love to be proven wrong here - I really hope there is a simple fix as you suggest.

          Show
          Todd Lipcon added a comment - Here's a patch which includes your patch from before, plus two new unit tests. Your patch fixes the original test, but not the two new ones. I think, unless we implement one of the proposals from my comment above, that there is no way to rearrange the current code that does this safely. I'd love to be proven wrong here - I really hope there is a simple fix as you suggest.
          Hide
          Todd Lipcon added a comment -

          because the problem stated here has a simple solution, and because it is important to fix it

          I'm still waiting for that simple solution. I don't believe it does have one.

          Your solution of reordering the loops shrinks the time window of the problem occurring, but it still occurs. Note that the "state 5" problem in my comment above occurs regardless of the number of storage directories.

          I'll upload a patch momentarily which includes another unit test that shows the issue.

          Show
          Todd Lipcon added a comment - because the problem stated here has a simple solution, and because it is important to fix it I'm still waiting for that simple solution. I don't believe it does have one. Your solution of reordering the loops shrinks the time window of the problem occurring, but it still occurs. Note that the "state 5" problem in my comment above occurs regardless of the number of storage directories. I'll upload a patch momentarily which includes another unit test that shows the issue.
          Hide
          Konstantin Shvachko added a comment -

          Todd, I did not get a clear idea whether what you are trying to explain belongs to HDFS-957 or HDFS-909 or yet a new jira. But it does not seem to belong here. I thought this issue was about fixing a rather simple bug of correcting the order in which directories are scanned when we save fsimage and purge edits. I don't think the issue should be converted into a project at this point, because the problem stated here has a simple solution, and because it is important to fix it. Could you please indicate whether you still have intention to fix it.

          Show
          Konstantin Shvachko added a comment - Todd, I did not get a clear idea whether what you are trying to explain belongs to HDFS-957 or HDFS-909 or yet a new jira. But it does not seem to belong here. I thought this issue was about fixing a rather simple bug of correcting the order in which directories are scanned when we save fsimage and purge edits. I don't think the issue should be converted into a project at this point, because the problem stated here has a simple solution, and because it is important to fix it. Could you please indicate whether you still have intention to fix it.
          Hide
          Todd Lipcon added a comment -

          One thing I'd also like to consider more is the interaction of these processes with filesystem journaling

          I read up a bit on journaling implementations in ext3, ext4, and xfs. It looks like the current "write file, close, then rename to replace" method is considered an antipattern by the kernel folks, since with delayed allocation it's possible that on a crash the new file will end up with length 0 (or zeroed blocks in XFS). We definitely need an fsync after writing the new fsimage. Will open a separate JIRA for this.

          Show
          Todd Lipcon added a comment - One thing I'd also like to consider more is the interaction of these processes with filesystem journaling I read up a bit on journaling implementations in ext3, ext4, and xfs. It looks like the current "write file, close, then rename to replace" method is considered an antipattern by the kernel folks, since with delayed allocation it's possible that on a crash the new file will end up with length 0 (or zeroed blocks in XFS). We definitely need an fsync after writing the new fsimage. Will open a separate JIRA for this.
          Hide
          Todd Lipcon added a comment -

          I worked through this a bit last night. Here are some options for a solution.

          1. Add "undo log" file to storage directory

          In this solution, we add a new file called "undolog" in each storage directory. Whenever we're in the midst of a transition, we write some bit of data in this file that explains what the proper rollback procedure is. Thus, for the checkpoint from the checkpoint node, we'd write a file that says "if IMAGE_NEW is complete, use IMAGE_NEW + EDITS_NEW. Otherwise use IMAGE + EDITS + EDITS_NEW". For the saveNamespace operation, we'd write "If IMAGE_NEW is complete, use IMAGE_NEW. Otherwise use IMAGE + EDITS + EDITS_NEW".

          This has the advantage of making the recovery choices explicit during all state transitions - we're forced to think carefully after each step of the operation in order to maintain the undo instructions.

          On the downside, it's more complexity.

          2. Don't allow -saveNamespace when the logs are in ROLLED state

          I don't like this one at all, but it would allow us to always use the IMAGE_NEW + EDITS_NEW recovery.

          3. Redesign rolling to not reuse filenames

          This is a much bigger change, but I think it would also help simplify a lot of the code. The proposal here is to manage edit logs in a way that's similar to what MySQL does. Specifically, instead of IMAGE and IMAGE_NEW plus EDITS and EDITS_NEW, we simply have a monotonically increasing identifier on each log file. So, the state of the system starts with image_0 and edits_0. Logs may be rolled at any point, which increments edits_N. So in a normal operation we'd see:

          image_0
          edits_0 <- writing here

          [roll edits]
          image_0
          edits_0
          edits_1 <- writing here

          [checkpoint node fetches image_0 and edits_0, and uploads images_1]

          image_0 <- this is now "stale" and can be garbage collected later
          image_1 <- this contains image_0 + edits_0
          edits_0 <- this is also stale
          edits_1 <- still being written

          This design has many plusses in my view:

          1. Files never change names, and thus race conditions like HDFS-909 are less likely, so long as the current number is synchronized.
          2. Recovery is much simpler - you can always recover from image_n + edits_n through edits_max, so long as image_n is complete. Any incomplete or corrupt images can always be safely ignored so long as there is an earlier one, plus all the edit logs going back to that point.
          3. the fstime checking logic is simplified - an image made from image_N plus edits_N through edits_(M-1) is always going to be called image_M. Any image_M from any storage directory should be identical regardless of any ongoing rolls.
          4. edit logs and images can both be kept for some time window, simpifying backup and recovery a bit while also providing an easy mechanism for point-in-time recovery of the namespace. Although PITR is less than useful if data blocks are gone, this mechanism would make it impossible for a bug like HDFS-909 or HDFS-955 to lose edits, since files are never truncated or removed until after they're "stale".
          5. We no longer have to be careful about the NN's "rolled" vs "upload_done" vs "start" state - the logs are looked at as constantly rolling, and it's always clear where to apply a checkpoint image.

          The downside, of course, is that it's a very big change, definitely not a candidate for backport, and could take a while.

          4. Distinguish IMAGE_NEW_CKPT vs IMAGE_NEW_SAVED

          Rather than having a single IMAGE_NEW filename like we do now, we could split it into IMAGE_NEW_CKPT and IMAGE_NEW_SAVED. The recovery mechanism for these would differ in that, if there is a completed IMAGE_NEW_CKPT, then it will recover IMAGE_NEW_CKPT + EDITS_NEW. If there is a completed IMAGE_NEW_SAVED, then it can truncate both EDITS and EDITS_NEW during recovery, since a saved namespace encompasses both.

          Unfortunately, not one of these is a simple fix. If you have any proposals that are both simple and correct, I'd be very interested to hear them.

          One thing I'd also like to consider more is the interaction of these processes with filesystem journaling. I'm not sure if ext3's data=ordered
          journaling mode (probably the most common deployment configuration) guarantees quite enough ordering between different files that all of the above will work correctly in the event of host failures. I need to learn more about that and report back.

          Show
          Todd Lipcon added a comment - I worked through this a bit last night. Here are some options for a solution. 1. Add "undo log" file to storage directory In this solution, we add a new file called "undolog" in each storage directory. Whenever we're in the midst of a transition, we write some bit of data in this file that explains what the proper rollback procedure is. Thus, for the checkpoint from the checkpoint node, we'd write a file that says "if IMAGE_NEW is complete, use IMAGE_NEW + EDITS_NEW. Otherwise use IMAGE + EDITS + EDITS_NEW". For the saveNamespace operation, we'd write "If IMAGE_NEW is complete, use IMAGE_NEW. Otherwise use IMAGE + EDITS + EDITS_NEW". This has the advantage of making the recovery choices explicit during all state transitions - we're forced to think carefully after each step of the operation in order to maintain the undo instructions. On the downside, it's more complexity. 2. Don't allow -saveNamespace when the logs are in ROLLED state I don't like this one at all, but it would allow us to always use the IMAGE_NEW + EDITS_NEW recovery. 3. Redesign rolling to not reuse filenames This is a much bigger change, but I think it would also help simplify a lot of the code. The proposal here is to manage edit logs in a way that's similar to what MySQL does. Specifically, instead of IMAGE and IMAGE_NEW plus EDITS and EDITS_NEW, we simply have a monotonically increasing identifier on each log file. So, the state of the system starts with image_0 and edits_0. Logs may be rolled at any point, which increments edits_N. So in a normal operation we'd see: image_0 edits_0 <- writing here [roll edits] image_0 edits_0 edits_1 <- writing here [checkpoint node fetches image_0 and edits_0, and uploads images_1] image_0 <- this is now "stale" and can be garbage collected later image_1 <- this contains image_0 + edits_0 edits_0 <- this is also stale edits_1 <- still being written This design has many plusses in my view: Files never change names, and thus race conditions like HDFS-909 are less likely, so long as the current number is synchronized. Recovery is much simpler - you can always recover from image_n + edits_n through edits_max, so long as image_n is complete. Any incomplete or corrupt images can always be safely ignored so long as there is an earlier one, plus all the edit logs going back to that point. the fstime checking logic is simplified - an image made from image_N plus edits_N through edits_(M-1) is always going to be called image_M. Any image_M from any storage directory should be identical regardless of any ongoing rolls. edit logs and images can both be kept for some time window, simpifying backup and recovery a bit while also providing an easy mechanism for point-in-time recovery of the namespace. Although PITR is less than useful if data blocks are gone, this mechanism would make it impossible for a bug like HDFS-909 or HDFS-955 to lose edits, since files are never truncated or removed until after they're "stale". We no longer have to be careful about the NN's "rolled" vs "upload_done" vs "start" state - the logs are looked at as constantly rolling, and it's always clear where to apply a checkpoint image. The downside, of course, is that it's a very big change, definitely not a candidate for backport, and could take a while. 4. Distinguish IMAGE_NEW_CKPT vs IMAGE_NEW_SAVED Rather than having a single IMAGE_NEW filename like we do now, we could split it into IMAGE_NEW_CKPT and IMAGE_NEW_SAVED. The recovery mechanism for these would differ in that, if there is a completed IMAGE_NEW_CKPT, then it will recover IMAGE_NEW_CKPT + EDITS_NEW. If there is a completed IMAGE_NEW_SAVED, then it can truncate both EDITS and EDITS_NEW during recovery, since a saved namespace encompasses both. Unfortunately, not one of these is a simple fix. If you have any proposals that are both simple and correct, I'd be very interested to hear them. One thing I'd also like to consider more is the interaction of these processes with filesystem journaling. I'm not sure if ext3's data=ordered journaling mode (probably the most common deployment configuration) guarantees quite enough ordering between different files that all of the above will work correctly in the event of host failures. I need to learn more about that and report back.
          Hide
          Todd Lipcon added a comment -

          Unfortunately HDFS-957 I think provides a fix for both of the examples above where the correct recovery in state 5 is IMAGE_NEW. However, in the case where we've just received an actual checkpoint but haven't rolled yet, the correct recovery is IMAGE_NEW + EDITS_NEW. I don't have a good way of distinguishing that situation from the manual saveNamespace where we don't want to recover EDITS_NEW.

          I don't know that we can actually fix this with the current set of files and determinable state. I'll continue to think.

          Show
          Todd Lipcon added a comment - Unfortunately HDFS-957 I think provides a fix for both of the examples above where the correct recovery in state 5 is IMAGE_NEW. However, in the case where we've just received an actual checkpoint but haven't rolled yet, the correct recovery is IMAGE_NEW + EDITS_NEW. I don't have a good way of distinguishing that situation from the manual saveNamespace where we don't want to recover EDITS_NEW. I don't know that we can actually fix this with the current set of files and determinable state. I'll continue to think.
          Hide
          Todd Lipcon added a comment -

          Without concurrent checkpoint

          Looking at this as a series of state transitions on the storage directory:

          State 1: normal operation
            Valid: IMAGE + EDITS
            (there is nothing special happening)
          
          State 2: createNewIfNotExists
            Valid a: IMAGE + EDITS
            Valid b: IMAGE + EDITS + EDITS_NEW (since EDITS_NEW is empty)
            Current recovery: b
          
          State 3: saving IMAGE_NEW,
            Same validity as State 2
            Current recovery: b
          
          State 4: save IMAGE_NEW complete
            Valid a: IMAGE + EDITS
            Valid b: IMAGE + EDITS + EDITS_NEW (since EDITS_NEW is empty)
            Valid c: IMAGE_NEW + EDITS_NEW (since EDITS_NEW is empty)
            Current recovery: b
          
          State 5: truncate EDITS and EDITS_NEW
            (a) and (b) are no longer valid
            Valid c: IMAGE_NEW + EDITS_NEW
            Current recovery: *b (this is the error we're seeing)
          
          State 6: rollFSImage -> purgeEditLog: moves EDITS_NEW to EDITS
            Valid: IMAGE_NEW + EDITS
            Current recovery: rename IMAGE_NEW to IMAGE (correct)
          
          State 7: rollFSImage -> renameCheckpoint: moves IMAGE_NEW to IMAGE
            Valid: IMAGE + EDITS
            Current recovery: no recovery necessary (correct)
          

          The problem here is in State 5. The question is how to detect that
          we are in this state during recovery so we can do the right thing.
          This is where HDFS-957 comes in. With 957, the recovery logic can easily
          determine that IMAGE_NEW is correct, and choose the same recovery
          mechanism as state 6.

          With ongoing checkpoint (logs start rolled)

          State 1: normal operation
            Valid: IMAGE + EDITS + EDITS_NEW
            Recovery: IMAGE + EDITS + EDITS_NEW
          
          State 2: createNewIfNotExists
            no effect - NEW already exists
          
          State 3: saving IMAGE_NEW,
            Same validity as State 2
            Current recovery: IMAGE + EDITS + EDITS_NEW
          
          State 4: save IMAGE_NEW complete
            Valid a: IMAGE + EDITS + EDITS_NEW
            Valid b: IMAGE_NEW only
            Current recovery: a
          
          State 5: truncate EDITS and EDITS_NEW
            Valid: IMAGE_NEW (any other recovery is incorrect)
            Current recovery: IMAGE + EDITS + EDITS_NEW (incorrect, loses data)
          
          State 6: rollFSImage -> purgeEditLog: moves EDITS_NEW to EDITS
            Valid: IMAGE_NEW + EDITS
            Current recovery: rename IMAGE_NEW to IMAGE (correct)
          
          State 7: rollFSImage -> renameCheckpoint: moves IMAGE_NEW to IMAGE
            Valid: IMAGE + EDITS
            Current recovery: no recovery necessary (correct)
          

          So the issue in both cases is essentially the same, and both can be solved
          if we use HDFS-957.

          I'll work on a patch for this.

          On a side note, I think there's another race where a checkpoint upload from the SNN can
          overlap with this operation and really screw things up. That's a separate JIRA though.

          Show
          Todd Lipcon added a comment - Without concurrent checkpoint Looking at this as a series of state transitions on the storage directory: State 1: normal operation Valid: IMAGE + EDITS (there is nothing special happening) State 2: createNewIfNotExists Valid a: IMAGE + EDITS Valid b: IMAGE + EDITS + EDITS_NEW (since EDITS_NEW is empty) Current recovery: b State 3: saving IMAGE_NEW, Same validity as State 2 Current recovery: b State 4: save IMAGE_NEW complete Valid a: IMAGE + EDITS Valid b: IMAGE + EDITS + EDITS_NEW (since EDITS_NEW is empty) Valid c: IMAGE_NEW + EDITS_NEW (since EDITS_NEW is empty) Current recovery: b State 5: truncate EDITS and EDITS_NEW (a) and (b) are no longer valid Valid c: IMAGE_NEW + EDITS_NEW Current recovery: *b (this is the error we're seeing) State 6: rollFSImage -> purgeEditLog: moves EDITS_NEW to EDITS Valid: IMAGE_NEW + EDITS Current recovery: rename IMAGE_NEW to IMAGE (correct) State 7: rollFSImage -> renameCheckpoint: moves IMAGE_NEW to IMAGE Valid: IMAGE + EDITS Current recovery: no recovery necessary (correct) The problem here is in State 5. The question is how to detect that we are in this state during recovery so we can do the right thing. This is where HDFS-957 comes in. With 957, the recovery logic can easily determine that IMAGE_NEW is correct, and choose the same recovery mechanism as state 6. With ongoing checkpoint (logs start rolled) State 1: normal operation Valid: IMAGE + EDITS + EDITS_NEW Recovery: IMAGE + EDITS + EDITS_NEW State 2: createNewIfNotExists no effect - NEW already exists State 3: saving IMAGE_NEW, Same validity as State 2 Current recovery: IMAGE + EDITS + EDITS_NEW State 4: save IMAGE_NEW complete Valid a: IMAGE + EDITS + EDITS_NEW Valid b: IMAGE_NEW only Current recovery: a State 5: truncate EDITS and EDITS_NEW Valid: IMAGE_NEW (any other recovery is incorrect) Current recovery: IMAGE + EDITS + EDITS_NEW (incorrect, loses data) State 6: rollFSImage -> purgeEditLog: moves EDITS_NEW to EDITS Valid: IMAGE_NEW + EDITS Current recovery: rename IMAGE_NEW to IMAGE (correct) State 7: rollFSImage -> renameCheckpoint: moves IMAGE_NEW to IMAGE Valid: IMAGE + EDITS Current recovery: no recovery necessary (correct) So the issue in both cases is essentially the same, and both can be solved if we use HDFS-957 . I'll work on a patch for this. On a side note, I think there's another race where a checkpoint upload from the SNN can overlap with this operation and really screw things up. That's a separate JIRA though.
          Hide
          Todd Lipcon added a comment -

          HADOOP-6554 is the NPE on close. Sorry, I should have mentioned you need to fix that by adding the appropriate if (thread != null) check. After fixing that, the test fails on the reopen. Also, the test doesn't fail if you switch to only one dfs.name.dir.

          I'll take a look at your patch, thanks.

          Show
          Todd Lipcon added a comment - HADOOP-6554 is the NPE on close. Sorry, I should have mentioned you need to fix that by adding the appropriate if (thread != null) check. After fixing that, the test fails on the reopen. Also, the test doesn't fail if you switch to only one dfs.name.dir. I'll take a look at your patch, thanks.
          Hide
          Konstantin Shvachko added a comment -

          I checked weather the test passes if we change the order of writing in dorectories. It does, although it still throws NPE on close. I am attaching the changes I maid to FSImage while running your test, in case it will save your time fixing this. If not please feel free to disregard it.

          Show
          Konstantin Shvachko added a comment - I checked weather the test passes if we change the order of writing in dorectories. It does, although it still throws NPE on close. I am attaching the changes I maid to FSImage while running your test, in case it will save your time fixing this. If not please feel free to disregard it.
          Hide
          Konstantin Shvachko added a comment -

          It is good to have the test. I ran it on current trunk. It successfully gets the expected exception, but then fails to close FSNamesystem with NPE. So yes it fails but not where you wanted it to.

          Show
          Konstantin Shvachko added a comment - It is good to have the test. I ran it on current trunk. It successfully gets the expected exception, but then fails to close FSNamesystem with NPE. So yes it fails but not where you wanted it to.
          Hide
          Todd Lipcon added a comment -

          Here is a test that shows the issue.

          Show
          Todd Lipcon added a comment - Here is a test that shows the issue.
          Hide
          Todd Lipcon added a comment -

          Adding 0.20.1 as an Affects Version since I can reproduce this there.. I think as a dataloss bug this should be eventually backported once we have a fix.

          Show
          Todd Lipcon added a comment - Adding 0.20.1 as an Affects Version since I can reproduce this there.. I think as a dataloss bug this should be eventually backported once we have a fix.
          Hide
          Todd Lipcon added a comment -

          This issue also occurs if there are multiple name/edit dirs and an exception is thrown while saving any fsimage other than the first.

          To reproduce, I added:

              if (savedCount++ > 0 && new File("/tmp/inject").exists()) {
                throw new IOException("Injected fault");
              } else {
                LOG.warn("Not injecting", new IOException());
              }
          

          along with a static int savedCount = 0 in FSImage.

          I started a NN, created some directories, shut it down. I then created /tmp/inject, and started the NN again. It failed while saving the second image, as planned. This left the edit dir in such a state that starting the NN up again recovered from the first name dir where the edits had been blown away.

          This should be unit testable with mockito. I'll try to form such a test.

          Show
          Todd Lipcon added a comment - This issue also occurs if there are multiple name/edit dirs and an exception is thrown while saving any fsimage other than the first. To reproduce, I added: if (savedCount++ > 0 && new File( "/tmp/inject" ).exists()) { throw new IOException( "Injected fault" ); } else { LOG.warn( "Not injecting" , new IOException()); } along with a static int savedCount = 0 in FSImage. I started a NN, created some directories, shut it down. I then created /tmp/inject, and started the NN again. It failed while saving the second image, as planned. This left the edit dir in such a state that starting the NN up again recovered from the first name dir where the edits had been blown away. This should be unit testable with mockito. I'll try to form such a test.
          Hide
          Todd Lipcon added a comment -

          I've verified the behavior from HDFS-909 on 0.20 (though I'm pretty certain it also exists on trunk).

          To reproduce, I did a little manual "fault injection" - I added

          if (new File("/tmp/savefsimage.die").exists()) System.exit(1);

          after saving IMAGE_NEW in saveFSImage. I then did the following sequence:

          • start NN
          • hadoop fs -mkdir test1
          • hadoop dfsadmin -safemode enter
          • touch /tmp/savefsimage.die
          • hadoop dfsadmin -saveNamespace
          • (NN "crashes")

          This leaves dfs.name.dir/current as:

          -rw-r--r-- 1 todd todd   4 2010-02-08 21:24 edits
          -rw-r--r-- 1 todd todd   4 2010-02-08 21:24 edits.new
          -rw-r--r-- 1 todd todd  94 2010-02-08 21:24 fsimage
          -rw-r--r-- 1 todd todd 323 2010-02-08 21:24 fsimage.ckpt
          -rw-r--r-- 1 todd todd   8 2010-02-08 21:24 fstime
          -rw-r--r-- 1 todd todd 100 2010-02-08 21:24 VERSION
          

          (fsimage.ckpt has the proper image including my directory)

          If I now remove the fault injection file and start the NN, it "recovers" to:

          -rw-r--r-- 1 todd todd   4 2010-02-08 21:25 edits
          -rw-r--r-- 1 todd todd  94 2010-02-08 21:25 fsimage
          -rw-r--r-- 1 todd todd   8 2010-02-08 21:25 fstime
          -rw-r--r-- 1 todd todd 100 2010-02-08 21:25 VERSION
          

          (ie all edits since last successful checkpoint were lost)

          Show
          Todd Lipcon added a comment - I've verified the behavior from HDFS-909 on 0.20 (though I'm pretty certain it also exists on trunk). To reproduce, I did a little manual "fault injection" - I added if ( new File( "/tmp/savefsimage.die" ).exists()) System .exit(1); after saving IMAGE_NEW in saveFSImage. I then did the following sequence: start NN hadoop fs -mkdir test1 hadoop dfsadmin -safemode enter touch /tmp/savefsimage.die hadoop dfsadmin -saveNamespace (NN "crashes") This leaves dfs.name.dir/current as: -rw-r--r-- 1 todd todd 4 2010-02-08 21:24 edits -rw-r--r-- 1 todd todd 4 2010-02-08 21:24 edits.new -rw-r--r-- 1 todd todd 94 2010-02-08 21:24 fsimage -rw-r--r-- 1 todd todd 323 2010-02-08 21:24 fsimage.ckpt -rw-r--r-- 1 todd todd 8 2010-02-08 21:24 fstime -rw-r--r-- 1 todd todd 100 2010-02-08 21:24 VERSION (fsimage.ckpt has the proper image including my directory) If I now remove the fault injection file and start the NN, it "recovers" to: -rw-r--r-- 1 todd todd 4 2010-02-08 21:25 edits -rw-r--r-- 1 todd todd 94 2010-02-08 21:25 fsimage -rw-r--r-- 1 todd todd 8 2010-02-08 21:25 fstime -rw-r--r-- 1 todd todd 100 2010-02-08 21:25 VERSION (ie all edits since last successful checkpoint were lost)
          Hide
          Todd Lipcon added a comment -

          Woops, somehow managed to mangle my edit buffer there... repasting, sorry for the spam.

          loadFSImage:
            - find a pair of EDITS and IMAGE that have the same checkpoint time and are from the latest checkpointTime
              (this ignores *_NEW)
            - recoverInterruptedCheckpoint:
              if there is an IMAGE_NEW:
                if there is EDITS_NEW:
                  delete IMAGE_NEW (since we assume we can replay from IMAGE + EDITS + EDITS_NEW?
                else:
                  replace IMAGE with IMAGE_NEW, delete IMAGE_NEW
            - load IMAGE
            - load EDITS
            - load EDITS_NEW
            - if need to save:
              saveFSImage:
                save IMAGE_NEW
                truncate EDITS
                if EDITS_NEW exists:
                  truncate EDITS_NEW
                rollFSImage:
                  purgeEditLog:
                    replace EDITS with EDITS_NEW
                  renameCheckpoint:
                    replace IMAGE with IMAGE_NEW
          
          Show
          Todd Lipcon added a comment - Woops, somehow managed to mangle my edit buffer there... repasting, sorry for the spam. loadFSImage: - find a pair of EDITS and IMAGE that have the same checkpoint time and are from the latest checkpointTime (this ignores *_NEW) - recoverInterruptedCheckpoint: if there is an IMAGE_NEW: if there is EDITS_NEW: delete IMAGE_NEW (since we assume we can replay from IMAGE + EDITS + EDITS_NEW? else: replace IMAGE with IMAGE_NEW, delete IMAGE_NEW - load IMAGE - load EDITS - load EDITS_NEW - if need to save: saveFSImage: save IMAGE_NEW truncate EDITS if EDITS_NEW exists: truncate EDITS_NEW rollFSImage: purgeEditLog: replace EDITS with EDITS_NEW renameCheckpoint: replace IMAGE with IMAGE_NEW
          Hide
          Todd Lipcon added a comment -

          loadFSImage:

          • find a pair of EDITS and IMAGE that have the same checkpoint time and are from the latest checkpointTime
            (this ignores *_NEW)
          • recoverInterruptedCheckpoint:
            if there is an IMAGE_NEW:
            if there is EDITS_NEW:
            delete IMAGE_NEW (since we assume we can replay from IMAGE + EDITS + EDITS_NEW?
            else:
            replace IMAGE with IMAGE_NEW, delete IMAGE_NEW
            I took some pseudocode notes on what's currently going on in the load/save code:
            - load IMAGE
            - load EDITS
            - load EDITS_NEW
          
            - if need to save:
              saveFSImage:
                save IMAGE_NEW
                truncate EDITS
                if EDITS_NEW exists:
                  truncate EDITS_NEW
                rollFSImage:
                  purgeEditLog:
                    replace EDITS with EDITS_NEW
                  renameCheckpoint:
                    replace IMAGE with IMAGE_NEW
          

          Next I'll look at a failure at each point and see if recovery works. Longer term we should also figure out how to get either use the FI test framework or some clever mockito spies to inject these failures for unit tests.

          Show
          Todd Lipcon added a comment - loadFSImage: find a pair of EDITS and IMAGE that have the same checkpoint time and are from the latest checkpointTime (this ignores *_NEW) recoverInterruptedCheckpoint: if there is an IMAGE_NEW: if there is EDITS_NEW: delete IMAGE_NEW (since we assume we can replay from IMAGE + EDITS + EDITS_NEW? else: replace IMAGE with IMAGE_NEW, delete IMAGE_NEW I took some pseudocode notes on what's currently going on in the load/save code: - load IMAGE - load EDITS - load EDITS_NEW - if need to save: saveFSImage: save IMAGE_NEW truncate EDITS if EDITS_NEW exists: truncate EDITS_NEW rollFSImage: purgeEditLog: replace EDITS with EDITS_NEW renameCheckpoint: replace IMAGE with IMAGE_NEW Next I'll look at a failure at each point and see if recovery works. Longer term we should also figure out how to get either use the FI test framework or some clever mockito spies to inject these failures for unit tests.
          Hide
          Todd Lipcon added a comment -

          Konstantin: by any chance, do you have a document that describes the NN's startup protocol with regards to image loading? To make sure we've got the failure scenarios correct we need to match up the recovery protocol to all of the failure points I think (eg what happens with a half-written IMAGE_NEW, what happens if some dirs have _NEW and others don't, etc).

          If no such document exists I'll go through the code to work on creating it, or at least a thorough JIRA comment we can reference from the code.

          Show
          Todd Lipcon added a comment - Konstantin: by any chance, do you have a document that describes the NN's startup protocol with regards to image loading? To make sure we've got the failure scenarios correct we need to match up the recovery protocol to all of the failure points I think (eg what happens with a half-written IMAGE_NEW, what happens if some dirs have _NEW and others don't, etc). If no such document exists I'll go through the code to work on creating it, or at least a thorough JIRA comment we can reference from the code.
          Hide
          Todd Lipcon added a comment -

          Reproducing the comment from HDFS-909:

          FSImage.saveFSImage has this code:

          1240       if (dirType.isOfType(NameNodeDirType.IMAGE))
          1241         saveFSImage(getImageFile(sd, NameNodeFile.IMAGE_NEW));
          1242       if (dirType.isOfType(NameNodeDirType.EDITS)) {
          1243         editLog.createEditLogFile(getImageFile(sd, NameNodeFile.EDITS));
          1244         File editsNew = getImageFile(sd, NameNodeFile.EDITS_NEW);
          1245         if (editsNew.exists())
          1246           editLog.createEditLogFile(editsNew);
          1247       }
          

          On line 1243 we truncate EDITS. Then if EDITS_NEW exists, we truncate it on 1246. All of this happens when the NN is in safe mode, so there shouldn't be any new edits coming in in the first place.

          I'm contending that line 1243 and 1245 should both be deleted. We should always create the image as IMAGE_NEW (line 1241). Touching EDITS seems incorrect - what if the order of storage dirs is EDITS then IMAGE, so we run line 1243, kill our current edit log, and then crash before saving the current image?

          Show
          Todd Lipcon added a comment - Reproducing the comment from HDFS-909 : FSImage.saveFSImage has this code: 1240 if (dirType.isOfType(NameNodeDirType.IMAGE)) 1241 saveFSImage(getImageFile(sd, NameNodeFile.IMAGE_NEW)); 1242 if (dirType.isOfType(NameNodeDirType.EDITS)) { 1243 editLog.createEditLogFile(getImageFile(sd, NameNodeFile.EDITS)); 1244 File editsNew = getImageFile(sd, NameNodeFile.EDITS_NEW); 1245 if (editsNew.exists()) 1246 editLog.createEditLogFile(editsNew); 1247 } On line 1243 we truncate EDITS. Then if EDITS_NEW exists, we truncate it on 1246. All of this happens when the NN is in safe mode, so there shouldn't be any new edits coming in in the first place. I'm contending that line 1243 and 1245 should both be deleted. We should always create the image as IMAGE_NEW (line 1241). Touching EDITS seems incorrect - what if the order of storage dirs is EDITS then IMAGE, so we run line 1243, kill our current edit log, and then crash before saving the current image?

            People

            • Assignee:
              Konstantin Shvachko
              Reporter:
              Todd Lipcon
            • Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development