Hadoop Common
  1. Hadoop Common
  2. HADOOP-4885

Try to restore failed replicas of Name Node storage (at checkpoint time)

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.21.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      Patch introduces new configuration switch dfs.name.dir.restore (boolean) enabling this functionality. Documentation needs to be updated.

      UPDATE: Config key is now "dfs.namenode.name.dir.restore" for 1.x and 2.x+ versions of HDFS
      Show
      Patch introduces new configuration switch dfs.name.dir.restore (boolean) enabling this functionality. Documentation needs to be updated. UPDATE: Config key is now "dfs.namenode.name.dir.restore" for 1.x and 2.x+ versions of HDFS
    1. HADOOP-4885-3.patch
      20 kB
      Boris Shkolnik
    2. HADOOP-4885-3.patch
      20 kB
      Boris Shkolnik
    3. HADOOP-4885-1.patch
      19 kB
      Boris Shkolnik
    4. HADOOP-4885.patch
      18 kB
      Boris Shkolnik
    5. HADOOP-4885.patch
      18 kB
      Boris Shkolnik
    6. HADOOP-4885.branch-1.patch.3
      20 kB
      Brandon Li
    7. HADOOP-4885.branch-1.patch.2
      20 kB
      Brandon Li
    8. HADOOP-4885.branch-1.patch
      20 kB
      Brandon Li

      Issue Links

        Activity

        Hide
        Boris Shkolnik added a comment -

        Problem:
        If one of the replicas of the NameNode storage fails for whatever reason (for example temporarily failure of NFS) this Storage object is removed from the list of storage objects forever. It can be added back only on restart of the NameNode. We propose to check the status of a failed storage on every checkpoint and if it becomes valid - try to restore the edits and fsimage.

        Show
        Boris Shkolnik added a comment - Problem: If one of the replicas of the NameNode storage fails for whatever reason (for example temporarily failure of NFS) this Storage object is removed from the list of storage objects forever. It can be added back only on restart of the NameNode. We propose to check the status of a failed storage on every checkpoint and if it becomes valid - try to restore the edits and fsimage.
        Hide
        Boris Shkolnik added a comment -

        Current implementation:
        There is a list of StorageDir objects associated with each FSImage.
        Also there is a list of EditLogs with each FSImage. One edit log has a
        corresponding StorageDir (same directory). When an IO error happens a
        corresponding EditLog and StorageDir are removed from the corresponding
        lists.

        Suggested solution:
        When a StorageDir is removed - instead of discarding it we will put it into
        a separate list (removedDir list).
        Edit log is removed and discarded.
        When a secondary node starts a checkpoint it first "rolls" editLogs
        (rollEditLogs).This function verifies that there is no edits.new in any of
        the currently active directories and than create them.

        Before it actually creates edits.new, we can go over the list of all the
        removed dirs and check if they became writable. If so - we can put them back
        into the list. So edits.new will be created there. We also will create
        EditLogs object. And when later checkpoint "puts" (putFSImage) fsimage
        there - the directory will became active.

        Show
        Boris Shkolnik added a comment - Current implementation: There is a list of StorageDir objects associated with each FSImage. Also there is a list of EditLogs with each FSImage. One edit log has a corresponding StorageDir (same directory). When an IO error happens a corresponding EditLog and StorageDir are removed from the corresponding lists. Suggested solution: When a StorageDir is removed - instead of discarding it we will put it into a separate list (removedDir list). Edit log is removed and discarded. When a secondary node starts a checkpoint it first "rolls" editLogs (rollEditLogs).This function verifies that there is no edits.new in any of the currently active directories and than create them. Before it actually creates edits.new, we can go over the list of all the removed dirs and check if they became writable. If so - we can put them back into the list. So edits.new will be created there. We also will create EditLogs object. And when later checkpoint "puts" (putFSImage) fsimage there - the directory will became active.
        Hide
        Boris Shkolnik added a comment -

        Konstantin Shvachko commented:

        > We should not place them into removeDir, but rather re-read configuration
        > at this moment. This would also help if a new directory was configured
        > as a new storage.

        If we go this route - we should probably only allow addition of the new (or restoration of the failed) directories. No removal of existing/working ones.

        Show
        Boris Shkolnik added a comment - Konstantin Shvachko commented: > We should not place them into removeDir, but rather re-read configuration > at this moment. This would also help if a new directory was configured > as a new storage. If we go this route - we should probably only allow addition of the new (or restoration of the failed) directories. No removal of existing/working ones.
        Hide
        Boris Shkolnik added a comment -

        Using configuration to figure out new or failed storage replicas may have some undesirable side effects. We do not keep the original configuration around. So the only option for us will be to reload a "default" configuration. This "default" configuration may or may not be the same as the one used when the cluster was started. One example of such a case would be any test that creates configuration on the flight by populating some of its values in the setUP or init functions. In this case first checkpoint will try to use all the settings from the default configuration instead of fixing the failed ones.

        Taking this into account I suggest we go back to the original propose, namely - remembering failed replicas, and trying to restore them on every checkpoint.

        Show
        Boris Shkolnik added a comment - Using configuration to figure out new or failed storage replicas may have some undesirable side effects. We do not keep the original configuration around. So the only option for us will be to reload a "default" configuration. This "default" configuration may or may not be the same as the one used when the cluster was started. One example of such a case would be any test that creates configuration on the flight by populating some of its values in the setUP or init functions. In this case first checkpoint will try to use all the settings from the default configuration instead of fixing the failed ones. Taking this into account I suggest we go back to the original propose, namely - remembering failed replicas, and trying to restore them on every checkpoint.
        Hide
        Boris Shkolnik added a comment -

        Another issue is testing. To test this feature we would need to simulate failure to write into edit logs. For manual testing I used mounting.
        But that may be non-portable to all the systems, so I need another solution.
        My suggestions is to create a "mock" of the stream class. It will extend from EditLogFileOutputStream and override write() methods.
        It will also introduce a boolean member flag to specify if it is in a "failure" mode. When the flag is set - the write methods will throw IOException, if not they will call corresponding functions from the super classes.

        Class will look something like this:
        ------------
        class EditLogFileErrorTestOutputStream extends EditLogFileOutputStream {
        private boolean throwException = false;
        public void setThrowException(boolean val)

        { throwException = val; }

        public void write(int b) throws IOException

        { if(throwException) throw new IOException("Simulated IOException in write()"); else super.write(b); }

        }
        --------------

        configuration will have a setting "dfs.name.dir.simulateError" to specify that it needs editLog streams of the EditLogFileErrorTestOutputStream class.

        Show
        Boris Shkolnik added a comment - Another issue is testing. To test this feature we would need to simulate failure to write into edit logs. For manual testing I used mounting. But that may be non-portable to all the systems, so I need another solution. My suggestions is to create a "mock" of the stream class. It will extend from EditLogFileOutputStream and override write() methods. It will also introduce a boolean member flag to specify if it is in a "failure" mode. When the flag is set - the write methods will throw IOException, if not they will call corresponding functions from the super classes. Class will look something like this: ------------ class EditLogFileErrorTestOutputStream extends EditLogFileOutputStream { private boolean throwException = false; public void setThrowException(boolean val) { throwException = val; } public void write(int b) throws IOException { if(throwException) throw new IOException("Simulated IOException in write()"); else super.write(b); } } -------------- configuration will have a setting "dfs.name.dir.simulateError" to specify that it needs editLog streams of the EditLogFileErrorTestOutputStream class.
        Hide
        Boris Shkolnik added a comment -

        other suggestions for simulating storage failures in the test:
        1. Using Security Manager
        unfortunately security manager is consulted during opening of a file for writing, but not for a writing into already open stream.

        2. deleting directories and calling doCheckpoint to force creation of a new edits file in these directories (which will fail and cause the storage to be moved to removedStorages).
        This looks promising. I am going to investigate this approach.

        Show
        Boris Shkolnik added a comment - other suggestions for simulating storage failures in the test: 1. Using Security Manager unfortunately security manager is consulted during opening of a file for writing, but not for a writing into already open stream. 2. deleting directories and calling doCheckpoint to force creation of a new edits file in these directories (which will fail and cause the storage to be moved to removedStorages). This looks promising. I am going to investigate this approach.
        Hide
        Boris Shkolnik added a comment -

        running patch-test on trunk

        ANT_HOME=/home/hadoopqa/tools/ant/apache-ant-1.7.1 ant -Dpatch.file=../patches/HADOOP-4885.patch -Dfindbugs.home=/home/ndaley/tools/findbugs/latest -Dforrest.home=/home/ndaley/tools/forrest/latest -Djava5.home=/usr/releng/tools/java/jdk1.5.0_06 -Dscratch.dir=../scratch_dir/ test-patch

        +1 overall.
        [exec]
        [exec] +1 @author. The patch does not contain any @author tags.
        [exec]
        [exec] +1 tests included. The patch appears to include 3 new or modified tests.
        [exec]
        [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
        [exec]
        [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
        [exec]
        [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
        [exec]
        [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
        [exec]

        Show
        Boris Shkolnik added a comment - running patch-test on trunk ANT_HOME=/home/hadoopqa/tools/ant/apache-ant-1.7.1 ant -Dpatch.file=../patches/ HADOOP-4885 .patch -Dfindbugs.home=/home/ndaley/tools/findbugs/latest -Dforrest.home=/home/ndaley/tools/forrest/latest -Djava5.home=/usr/releng/tools/java/jdk1.5.0_06 -Dscratch.dir=../scratch_dir/ test-patch +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity. [exec]
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12399026/HADOOP-4885.patch
        against trunk revision 739416.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3781/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3781/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3781/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3781/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12399026/HADOOP-4885.patch against trunk revision 739416. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3781/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3781/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3781/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3781/console This message is automatically generated.
        Hide
        Konstantin Shvachko added a comment -
        1. storageList() should be called listStorageDirectories(), it should be a member of the Storage class, and then you can use simple loop like
          for (StorageDirectory sd : storageDirs) {
          

          Also it would be better to use sd instead of st.

        2. There are some trivial changes like empty line and adding { } in a one line if statement. It would be better to remove those.
        3. Long lines in tryToRestoreRemovedStorage() should be split.
        4. In line
          if(root.exists() && root.canWrite() && FileUtil.fullyDelete(root) && root.mkdir()) {
            format(sd);
          

          You do all operations twice because format() does fullyDelete and mkDirs inside.

        5. In FSEditLog.logEdit() you should use warn(Object arg, Throwable arg); rather than printing ie.getLocalizedMessage()
        6. If possible please keep line length within the 80 margin acording to hadoop coding style.
        7. "failed" vs "faild" in getEditLogSize()
        8. In rollEditLog()
          1. // check if any of failed storage is now available and put it back
          2. You replaced processIOError(sd); by simply removing edit streams from the list of streams. That does not seem to be right. The name-node should fail if the last stream is being removed.
        Show
        Konstantin Shvachko added a comment - storageList() should be called listStorageDirectories() , it should be a member of the Storage class, and then you can use simple loop like for (StorageDirectory sd : storageDirs) { Also it would be better to use sd instead of st . There are some trivial changes like empty line and adding { } in a one line if statement. It would be better to remove those. Long lines in tryToRestoreRemovedStorage() should be split. In line if (root.exists() && root.canWrite() && FileUtil.fullyDelete(root) && root.mkdir()) { format(sd); You do all operations twice because format() does fullyDelete and mkDirs inside. In FSEditLog.logEdit() you should use warn(Object arg, Throwable arg); rather than printing ie.getLocalizedMessage() If possible please keep line length within the 80 margin acording to hadoop coding style. "failed" vs "faild" in getEditLogSize() In rollEditLog() // check if any of failed storage is now available and put it back You replaced processIOError(sd); by simply removing edit streams from the list of streams. That does not seem to be right. The name-node should fail if the last stream is being removed.
        Hide
        Boris Shkolnik added a comment -

        implemented comments

        Show
        Boris Shkolnik added a comment - implemented comments
        Hide
        Konstantin Shvachko added a comment -

        +1 This looks good.
        Need test results. Resubmitting.

        Show
        Konstantin Shvachko added a comment - +1 This looks good. Need test results. Resubmitting.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12399585/HADOOP-4885-1.patch
        against trunk revision 742698.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3820/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3820/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3820/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3820/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12399585/HADOOP-4885-1.patch against trunk revision 742698. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3820/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3820/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3820/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3820/console This message is automatically generated.
        Hide
        Boris Shkolnik added a comment -

        failed test is reported in HADOOP-5172 to fail regularly on Hudson builds.

        Show
        Boris Shkolnik added a comment - failed test is reported in HADOOP-5172 to fail regularly on Hudson builds.
        Hide
        Boris Shkolnik added a comment -

        added license information to the new file.

        Show
        Boris Shkolnik added a comment - added license information to the new file.
        Hide
        Konstantin Shvachko added a comment -

        I just committed this. Thank you Boris.

        Show
        Konstantin Shvachko added a comment - I just committed this. Thank you Boris.
        Hide
        Hudson added a comment -

        Integrated in Hadoop-trunk #758 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/758/)
        . Committing additional file TestStorageRestore.java.
        . Try to restore failed name-node storage directories at checkpoint time. Contributed by Boris Shkolnik.

        Show
        Hudson added a comment - Integrated in Hadoop-trunk #758 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/758/ ) . Committing additional file TestStorageRestore.java. . Try to restore failed name-node storage directories at checkpoint time. Contributed by Boris Shkolnik.
        Hide
        Konstantin Boudnik added a comment -

        This feature seems like not completely working: adding back once removed storage volume doesn't happen as expected (see HDFS-1496). I'd suggest to disable this new feature for now

        Show
        Konstantin Boudnik added a comment - This feature seems like not completely working: adding back once removed storage volume doesn't happen as expected (see HDFS-1496 ). I'd suggest to disable this new feature for now
        Hide
        Konstantin Boudnik added a comment -

        After a bit more of investigation I have noticed (dah!) this new config parameter dfs.name.dir.restore which triggers removed storage restoration. fsimage flies for both (nfs'ed and non-nfs volumes) as well as secondary NN's checkpoints have the same md5sums. So it seems that (as Hairong pointed out elsewhere) that without HDFS-903 this features kinda works.

        Show
        Konstantin Boudnik added a comment - After a bit more of investigation I have noticed (dah!) this new config parameter dfs.name.dir.restore which triggers removed storage restoration. fsimage flies for both (nfs'ed and non-nfs volumes) as well as secondary NN's checkpoints have the same md5sums. So it seems that (as Hairong pointed out elsewhere) that without HDFS-903 this features kinda works.
        Hide
        Boris Shkolnik added a comment -

        fix submitted in HDFS-1602.

        Show
        Boris Shkolnik added a comment - fix submitted in HDFS-1602 .
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-trunk-Commit #539 (See https://hudson.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/539/)

        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #539 (See https://hudson.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/539/ )
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-trunk #643 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk/643/)

        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #643 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk/643/ )
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-22-branch #35 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-22-branch/35/)

        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-22-branch #35 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-22-branch/35/ )
        Hide
        Brandon Li added a comment -

        Backport this fix along with the test to branch-1.

        Show
        Brandon Li added a comment - Backport this fix along with the test to branch-1.
        Hide
        Tsz Wo Nicholas Sze added a comment -

        The getRestoreRemovedDirs() below should be removed.

        +  boolean getRestoreRemovedDirs() {
        +    return this.restoreRemovedDirs;
        +  }
        
        Show
        Tsz Wo Nicholas Sze added a comment - The getRestoreRemovedDirs() below should be removed. + boolean getRestoreRemovedDirs() { + return this .restoreRemovedDirs; + }
        Hide
        Brandon Li added a comment -

        Removed the unused method in the new patch.

        Show
        Brandon Li added a comment - Removed the unused method in the new patch.
        Hide
        Eli Collins added a comment -
        • In the trunk patch we're also adding directories to removedStorageDirs, seems like we'll need those additions here right?
        • The trunk version uses addStorageDir, any reason that it's done differently here?
        • Testing?
        Show
        Eli Collins added a comment - In the trunk patch we're also adding directories to removedStorageDirs, seems like we'll need those additions here right? The trunk version uses addStorageDir, any reason that it's done differently here? Testing?
        Hide
        Brandon Li added a comment -

        Hi Eli,
        Thanks for the comments!
        The code base in branch-1 is slightly different with 0.21.
        Adding directories to removedStorageDirs in original patch is already in branch-1.
        I didn't get your second question: my patch uses addStorageDir too.
        The same test with minor modification(e.g., comparing md5 instead of length for edits files) is included in the backport patch.

        Thanks.

        Show
        Brandon Li added a comment - Hi Eli, Thanks for the comments! The code base in branch-1 is slightly different with 0.21. Adding directories to removedStorageDirs in original patch is already in branch-1. I didn't get your second question: my patch uses addStorageDir too. The same test with minor modification(e.g., comparing md5 instead of length for edits files) is included in the backport patch. Thanks.
        Hide
        Jitendra Nath Pandey added a comment -

        The patch looks good to me. It would be great to add a few lines to test that namenode can restart with just a restored edits/image directory.

        Show
        Jitendra Nath Pandey added a comment - The patch looks good to me. It would be great to add a few lines to test that namenode can restart with just a restored edits/image directory.
        Hide
        Brandon Li added a comment -

        The new patch has the restart test. Thanks!

        Show
        Brandon Li added a comment - The new patch has the restart test. Thanks!
        Hide
        Jitendra Nath Pandey added a comment -

        +1 the patch for branch-1 looks good to me.

        Show
        Jitendra Nath Pandey added a comment - +1 the patch for branch-1 looks good to me.
        Hide
        Tsz Wo Nicholas Sze added a comment -

        +1 The patch also looks good to me. I will commit this in HDFS-3075.

        Show
        Tsz Wo Nicholas Sze added a comment - +1 The patch also looks good to me. I will commit this in HDFS-3075 .
        Hide
        Eli Collins added a comment -

        I didn't get your seco- nd question: my patch uses addStorageDir too.

        What I meant was the trunk patch does the following, which is much shorter:

        sd.clearDirectory();
        addStorageDir(sd);
        

        and leverages the fact that checkpoint populates the directory. Why not use the same approach here?

        • I'd test with a real NFS mount and disconnect/reconnect the network. I found some bugs that way when backporting this a while back. Also discovered HDFS-2701, HDFS-2702, HDFS-2703 via testing with a real build instead of the unit tests.
        • Nit: s/"may should be mounted"/"may be a network mount"/
        Show
        Eli Collins added a comment - I didn't get your seco- nd question: my patch uses addStorageDir too. What I meant was the trunk patch does the following, which is much shorter: sd.clearDirectory(); addStorageDir(sd); and leverages the fact that checkpoint populates the directory. Why not use the same approach here? I'd test with a real NFS mount and disconnect/reconnect the network. I found some bugs that way when backporting this a while back. Also discovered HDFS-2701 , HDFS-2702 , HDFS-2703 via testing with a real build instead of the unit tests. Nit: s/"may should be mounted"/"may be a network mount"/
        Hide
        Tsz Wo Nicholas Sze added a comment -

        Hi Eli,

        Brandon addressed all your earlier comment last night. I did not see your further comment so that I committed the patch.

        You made some good points in your previous comment. As always, we could file a JIRA for them.

        Does it sound good?

        Show
        Tsz Wo Nicholas Sze added a comment - Hi Eli, Brandon addressed all your earlier comment last night. I did not see your further comment so that I committed the patch. You made some good points in your previous comment . As always, we could file a JIRA for them. Does it sound good?
        Hide
        Brandon Li added a comment -

        The format-addStorageDir solution make the failed directory "active" immediately even it's not a real active state. The state is visible from the nn UI and JMX. If the checkpoint fails, the fake "Active" state can be misleading.

        The copy-over solution may do some extra work but it sets the recovered storage directories in the real active state.

        I agree those 3 JIRA issues you mentioned should be back ported too to branch 1.02 (the backport patch here is for branch-1 not 1.02).

        It's a good point about the network mount problem.
        It's also a problem with original patch: the "format-addStorageDir" creates the storage directory if it doesn't exist. However, if this storage directory is a mount point, it shouldn't be created automatically. HDFS-3095 is filed for this issue.

        Show
        Brandon Li added a comment - The format-addStorageDir solution make the failed directory "active" immediately even it's not a real active state. The state is visible from the nn UI and JMX. If the checkpoint fails, the fake "Active" state can be misleading. The copy-over solution may do some extra work but it sets the recovered storage directories in the real active state. I agree those 3 JIRA issues you mentioned should be back ported too to branch 1.02 (the backport patch here is for branch-1 not 1.02). It's a good point about the network mount problem. It's also a problem with original patch: the "format-addStorageDir" creates the storage directory if it doesn't exist. However, if this storage directory is a mount point, it shouldn't be created automatically. HDFS-3095 is filed for this issue.
        Hide
        Eli Collins added a comment -

        The format-addStorageDir solution make the failed directory "active" immediately even it's not a real active state. The state is visible from the nn UI and JMX. If the checkpoint fails, the fake "Active" state can be misleading.

        Not sure I'm following.. when you roll the log and it restores the storage directory it creates a new empty storage dir, the directory is added to the list of storage dirs and a new edit log is immediately created on it (see FSEditLog#rollEditLog), ie it is immediately "active" right?

        Have you done any testing of this patch aside from running the unit tests?

        Show
        Eli Collins added a comment - The format-addStorageDir solution make the failed directory "active" immediately even it's not a real active state. The state is visible from the nn UI and JMX. If the checkpoint fails, the fake "Active" state can be misleading. Not sure I'm following.. when you roll the log and it restores the storage directory it creates a new empty storage dir, the directory is added to the list of storage dirs and a new edit log is immediately created on it (see FSEditLog#rollEditLog), ie it is immediately "active" right? Have you done any testing of this patch aside from running the unit tests?
        Hide
        Brandon Li added a comment -

        I did manual tests before including the unit tests in the backport patch.

        The new edit log is immediately created in the new storage directory, but the rolled edit log doesn't exist in the recovered storage directory. From the NN UI, the healthy storage directory and recovered directory both have "Active" status. This is why I said it's "misleading".

        It would be a more obvious problem when the storage directory is a fsimage only directory. From the NN UI/JMX, the administrator can't tell which "Active" storage directory has fsimage inside and which doesn't. The same "Active" state here means differently at differnt time with different directories.

        Show
        Brandon Li added a comment - I did manual tests before including the unit tests in the backport patch. The new edit log is immediately created in the new storage directory, but the rolled edit log doesn't exist in the recovered storage directory. From the NN UI, the healthy storage directory and recovered directory both have "Active" status. This is why I said it's "misleading". It would be a more obvious problem when the storage directory is a fsimage only directory. From the NN UI/JMX, the administrator can't tell which "Active" storage directory has fsimage inside and which doesn't. The same "Active" state here means differently at differnt time with different directories.
        Hide
        Eli Collins added a comment -

        Ah, that makes sense. Thanks for the explanation! +1 to the latest patch

        Show
        Eli Collins added a comment - Ah, that makes sense. Thanks for the explanation! +1 to the latest patch
        Hide
        Nathan Roberts added a comment -

        Quick question on this patch. Are there any negative effects if the images being restored are very large or the restore is otherwise very slow? Just wondering because at first glance it looks like the restoration is being done after closing the current edits log and before starting edits.new.

        Show
        Nathan Roberts added a comment - Quick question on this patch. Are there any negative effects if the images being restored are very large or the restore is otherwise very slow? Just wondering because at first glance it looks like the restoration is being done after closing the current edits log and before starting edits.new.
        Hide
        Brandon Li added a comment -

        Hi Nathan,
        It could be slow if the image is very large though currently the image size is limited by the memory size.

        Thanks,
        Brandon

        Show
        Brandon Li added a comment - Hi Nathan, It could be slow if the image is very large though currently the image size is limited by the memory size. Thanks, Brandon
        Hide
        Nathan Roberts added a comment -

        Thanks for the response Brandon. My real concern is whether or not the namenode can continue completely normal operation during a long running restoration (several minutes for an image of 10s of GB). Or, since we haven't started the edits.new file yet, something may actually block??

        Show
        Nathan Roberts added a comment - Thanks for the response Brandon. My real concern is whether or not the namenode can continue completely normal operation during a long running restoration (several minutes for an image of 10s of GB). Or, since we haven't started the edits.new file yet, something may actually block??
        Hide
        Kihwal Lee added a comment -
        • It would be better if the resyncing up to the last closed edit log is done asynchronously. That way, the NN only needs to sync one or two edits while rolling the log.
        • It seems that if a restore fails, rollEditLog() also fails even if there are healthy directories. Ideally any exceptions from dealing with the removed dirs should be ignored.
        Show
        Kihwal Lee added a comment - It would be better if the resyncing up to the last closed edit log is done asynchronously. That way, the NN only needs to sync one or two edits while rolling the log. It seems that if a restore fails, rollEditLog() also fails even if there are healthy directories. Ideally any exceptions from dealing with the removed dirs should be ignored.
        Hide
        Brandon Li added a comment -

        since we haven't started the edits.new file yet, something may actually block??

        Filesystem modification can be blocked. This step should be optimized. By default, the automatic restore is disabled.

        It would be better if the resyncing up to the last closed edit log is done asynchronously.

        This could be a way to optimize the operation. Another way is not to copy over the files but wait for the checkpoint processing to populate the new image and edit logs. For the second approach the storage directories under restoring should have a new state (e.g., formatted or restoring) rather than "active".

        Ideally any exceptions from dealing with the removed dirs should be ignored.

        Agree.

        Show
        Brandon Li added a comment - since we haven't started the edits.new file yet, something may actually block?? Filesystem modification can be blocked. This step should be optimized. By default, the automatic restore is disabled. It would be better if the resyncing up to the last closed edit log is done asynchronously. This could be a way to optimize the operation. Another way is not to copy over the files but wait for the checkpoint processing to populate the new image and edit logs. For the second approach the storage directories under restoring should have a new state (e.g., formatted or restoring) rather than "active". Ideally any exceptions from dealing with the removed dirs should be ignored. Agree.
        Hide
        Harsh J added a comment -

        Looks like this made it to branch-1 at some point but no one edited the fix versions field yet. Setting to 1.1.0 (as CHANGES seem to suggest 1.0.2 was unreleased then) but correct as necessary.

        Show
        Harsh J added a comment - Looks like this made it to branch-1 at some point but no one edited the fix versions field yet. Setting to 1.1.0 (as CHANGES seem to suggest 1.0.2 was unreleased then) but correct as necessary.
        Hide
        Harsh J added a comment -

        Correction: 1.0.3 has it.

        Show
        Harsh J added a comment - Correction: 1.0.3 has it.
        Hide
        Bertrand Dechoux added a comment -

        I did a
        grep -R "dfs.name.dir.restore" src
        on a downloaded version of Hadoop 1.0.3 and found no match.
        Maybe the fix version should be updated.

        Show
        Bertrand Dechoux added a comment - I did a grep -R "dfs.name.dir.restore" src on a downloaded version of Hadoop 1.0.3 and found no match. Maybe the fix version should be updated.
        Hide
        Harsh J added a comment -

        Lets visit HDFS-3075 for the backport. I removed the versioning from here as it was erroneous.

        Show
        Harsh J added a comment - Lets visit HDFS-3075 for the backport. I removed the versioning from here as it was erroneous.
        Hide
        Brandon Li added a comment -

        I did a grep -R "dfs.name.dir.restore" srcon a downloaded version of Hadoop 1.0.3 and found no match.

        The property name is dfs.namenode.name.dir.restore.

        Show
        Brandon Li added a comment - I did a grep -R "dfs.name.dir.restore" srcon a downloaded version of Hadoop 1.0.3 and found no match. The property name is dfs.namenode.name.dir.restore.
        Hide
        Bertrand Dechoux added a comment -

        grep -R "dfs.namenode.name.dir.restore" *

        src/hdfs/org/apache/hadoop/hdfs/DFSConfigKeys.java: public static final String DFS_NAMENODE_NAME_DIR_RESTORE_KEY = "dfs.namenode.name.dir.restore";

        Great! I will test it. The documentation does not seem updated but that's a detail. (same for the description of the jira...)

        Show
        Bertrand Dechoux added a comment - grep -R "dfs.namenode.name.dir.restore" * src/hdfs/org/apache/hadoop/hdfs/DFSConfigKeys.java: public static final String DFS_NAMENODE_NAME_DIR_RESTORE_KEY = "dfs.namenode.name.dir.restore"; Great! I will test it. The documentation does not seem updated but that's a detail. (same for the description of the jira...)
        Hide
        Bertrand Dechoux added a comment -

        Sorry for the misunderstanding but its seems like it was indeed added (in the 1.0.2 like the backport says) so maybe this JIRA could be once again be updated to reflect it. (I can't modify the fix versions.) I will probably test this feature for 'curiosity' soon.

        Show
        Bertrand Dechoux added a comment - Sorry for the misunderstanding but its seems like it was indeed added (in the 1.0.2 like the backport says) so maybe this JIRA could be once again be updated to reflect it. (I can't modify the fix versions.) I will probably test this feature for 'curiosity' soon.

          People

          • Assignee:
            Boris Shkolnik
            Reporter:
            Boris Shkolnik
          • Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development