Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-3540

Further improvement on recovery mode and edit log toleration in branch-1

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.2.0
    • Fix Version/s: 1.2.0
    • Component/s: namenode
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Recovery Mode: HDFS-3479 backported HDFS-3335 to branch-1. However, the recovery mode feature in branch-1 is dramatically different from the recovery mode in trunk since the edit log implementations in these two branch are different. For example, there is UNCHECKED_REGION_LENGTH in branch-1 but not in trunk.

      Edit Log Toleration: HDFS-3521 added this feature to branch-1 to remedy UNCHECKED_REGION_LENGTH and to tolerate edit log corruption.

      There are overlaps between these two features. We study potential further improvement in this issue.

      1. h3540_20120925.patch
        2 kB
        Tsz Wo Nicholas Sze
      2. h3540_20120926.patch
        4 kB
        Tsz Wo Nicholas Sze
      3. h3540_20120927.patch
        5 kB
        Tsz Wo Nicholas Sze
      4. h3540_20121009.patch
        6 kB
        Tsz Wo Nicholas Sze
      5. HDFS-3540-b1.004.patch
        31 kB
        Colin Patrick McCabe

        Issue Links

          Activity

          Hide
          Matt Foley added a comment -

          Closed upon release of Hadoop 1.2.0.

          Show
          Matt Foley added a comment - Closed upon release of Hadoop 1.2.0.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > After the default is changed, all existing tests except TestEditLogToleration ...

          I should say "except for a few" instead since some Recovery Mode tests run with DFS_NAMENODE_EDITS_TOLERATION_LENGTH_KEY == -1.

          Show
          Tsz Wo Nicholas Sze added a comment - > After the default is changed, all existing tests except TestEditLogToleration ... I should say "except for a few" instead since some Recovery Mode tests run with DFS_NAMENODE_EDITS_TOLERATION_LENGTH_KEY == -1.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          I have committed this.

          Show
          Tsz Wo Nicholas Sze added a comment - I have committed this.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          After the default is changed, all existing tests except TestEditLogToleration run with DFS_NAMENODE_EDITS_TOLERATION_LENGTH_KEY==0.

          Show
          Tsz Wo Nicholas Sze added a comment - After the default is changed, all existing tests except TestEditLogToleration run with DFS_NAMENODE_EDITS_TOLERATION_LENGTH_KEY==0.
          Hide
          Suresh Srinivas added a comment -

          +1 for the patch. Are there any tests for DFS_NAMENODE_EDITS_TOLERATION_LENGTH_KEY set to 0?

          Show
          Suresh Srinivas added a comment - +1 for the patch. Are there any tests for DFS_NAMENODE_EDITS_TOLERATION_LENGTH_KEY set to 0?
          Hide
          Tsz Wo Nicholas Sze added a comment -
               [exec] -1 overall.  
               [exec] 
               [exec]     +1 @author.  The patch does not contain any @author tags.
               [exec] 
               [exec]     +1 tests included.  The patch appears to include 6 new or modified tests.
               [exec] 
               [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
               [exec] 
               [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
               [exec] 
               [exec]     -1 findbugs.  The patch appears to introduce 9 new Findbugs (version 1.3.9) warnings.
          

          I also have run the unit tests.

          Show
          Tsz Wo Nicholas Sze added a comment - [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 6 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] -1 findbugs. The patch appears to introduce 9 new Findbugs (version 1.3.9) warnings. I also have run the unit tests.
          Hide
          Colin Patrick McCabe added a comment -

          Looks good to me.

          Show
          Colin Patrick McCabe added a comment - Looks good to me.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          h3540_20121009.patch: updates TestEditLogLoading.

          Show
          Tsz Wo Nicholas Sze added a comment - h3540_20121009.patch: updates TestEditLogLoading.
          Hide
          Colin Patrick McCabe added a comment -

          It looks reasonable to me. I think you need to update TestEditLogLoading to look for the right error message, since changing the default value of edit.log.toleration changes the IOException message that will be thrown.

          Show
          Colin Patrick McCabe added a comment - It looks reasonable to me. I think you need to update TestEditLogLoading to look for the right error message, since changing the default value of edit.log.toleration changes the IOException message that will be thrown.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Oops, forgot to upload the patch earlier.

          h3540_20120927.patch: updates TestNameNodeRecovery.

          Show
          Tsz Wo Nicholas Sze added a comment - Oops, forgot to upload the patch earlier. h3540_20120927.patch: updates TestNameNodeRecovery.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Colin, thanks for the hint. Will update the tests.

          Show
          Tsz Wo Nicholas Sze added a comment - Colin, thanks for the hint. Will update the tests.
          Hide
          Colin Patrick McCabe added a comment -

          The approach seems reasonable. However, some changes may be necessary to get the unit tests running, especially TestNameNodeRecovery, which does not alter the default value for edit log toleration, and will not work if there are prompts during the startup process.

          Show
          Colin Patrick McCabe added a comment - The approach seems reasonable. However, some changes may be necessary to get the unit tests running, especially TestNameNodeRecovery, which does not alter the default value for edit log toleration, and will not work if there are prompts during the startup process.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          "Cancel Patch" since Jenkins could not test branch-1.

          Show
          Tsz Wo Nicholas Sze added a comment - "Cancel Patch" since Jenkins could not test branch-1.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          h3540_20120926.patch: let's prompt the user.

          Show
          Tsz Wo Nicholas Sze added a comment - h3540_20120926.patch: let's prompt the user.
          Hide
          Colin Patrick McCabe added a comment -

          Hi Colin, thanks for posting a patch. As you mentioned, branch-1 is a stable branch so that it is better to minimize the change. Let's change the default toleration length to 0. Then, user could see all the length information before using recovery mode. Does it sound good to you?

          This seems reasonable to me.

          It seems like if we're going to go this route, specifying -recover should also cause edit.log.toleration to be set to -1 automatically?. Otherwise Recovery Mode will not work, and the user will probably be confused as to why. Perhaps a change like this?

          diff --git a/src/hdfs/org/apache/hadoop/hdfs/server/namenode/FSImage.java b/src/hdfs/org/apache/hadoop/hdfs/server/namenode/FSImage.java
          index e4e5045..82d70b4 100644
          --- a/src/hdfs/org/apache/hadoop/hdfs/server/namenode/FSImage.java
          +++ b/src/hdfs/org/apache/hadoop/hdfs/server/namenode/FSImage.java
          @@ -1031,7 +1031,8 @@ public class FSImage extends Storage {
               int numEdits = 0;
               EditLogFileInputStream edits = 
                 new EditLogFileInputStream(getImageFile(sd, NameNodeFile.EDITS));
          -    numEdits = FSEditLog.loadFSEdits(edits, editsTolerationLength, recovery);
          +    numEdits = FSEditLog.loadFSEdits(edits, 
          +        recovery == null ? editsTolerationLength : -1, recovery);
               edits.close();
               File editsNew = getImageFile(sd, NameNodeFile.EDITS_NEW);
               if (editsNew.exists() && editsNew.length() > 0) {
          
          Show
          Colin Patrick McCabe added a comment - Hi Colin, thanks for posting a patch. As you mentioned, branch-1 is a stable branch so that it is better to minimize the change. Let's change the default toleration length to 0. Then, user could see all the length information before using recovery mode. Does it sound good to you? This seems reasonable to me. It seems like if we're going to go this route, specifying -recover should also cause edit.log.toleration to be set to -1 automatically?. Otherwise Recovery Mode will not work, and the user will probably be confused as to why. Perhaps a change like this? diff --git a/src/hdfs/org/apache/hadoop/hdfs/server/namenode/FSImage.java b/src/hdfs/org/apache/hadoop/hdfs/server/namenode/FSImage.java index e4e5045..82d70b4 100644 --- a/src/hdfs/org/apache/hadoop/hdfs/server/namenode/FSImage.java +++ b/src/hdfs/org/apache/hadoop/hdfs/server/namenode/FSImage.java @@ -1031,7 +1031,8 @@ public class FSImage extends Storage { int numEdits = 0; EditLogFileInputStream edits = new EditLogFileInputStream(getImageFile(sd, NameNodeFile.EDITS)); - numEdits = FSEditLog.loadFSEdits(edits, editsTolerationLength, recovery); + numEdits = FSEditLog.loadFSEdits(edits, + recovery == null ? editsTolerationLength : -1, recovery); edits.close(); File editsNew = getImageFile(sd, NameNodeFile.EDITS_NEW); if (editsNew.exists() && editsNew.length() > 0) {
          Hide
          Colin Patrick McCabe added a comment -

          Hi Colin, thanks for posting a patch. As you mentioned, branch-1 is a stable branch so that it is better to minimize the change. Let's change the default toleration length to 0. Then, user could see all the length information before using recovery mode. Does it sound good to you?

          This seems reasonable to me.

          It seems like if we're going to go this route, specifying -recover should also cause edit.log.toleration to be set to -1 automatically?. Otherwise Recovery Mode will not work, and the user will probably be confused as to why. Perhaps a change like this?

          diff --git a/src/hdfs/org/apache/hadoop/hdfs/server/namenode/FSImage.java b/src/hdfs/org/apache/hadoop/hdfs/server/namenode/FSImage.java
          index e4e5045..82d70b4 100644
          --- a/src/hdfs/org/apache/hadoop/hdfs/server/namenode/FSImage.java
          +++ b/src/hdfs/org/apache/hadoop/hdfs/server/namenode/FSImage.java
          @@ -1031,7 +1031,8 @@ public class FSImage extends Storage {
               int numEdits = 0;
               EditLogFileInputStream edits = 
                 new EditLogFileInputStream(getImageFile(sd, NameNodeFile.EDITS));
          -    numEdits = FSEditLog.loadFSEdits(edits, editsTolerationLength, recovery);
          +    numEdits = FSEditLog.loadFSEdits(edits, 
          +        recovery == null ? editsTolerationLength : -1, recovery);
               edits.close();
               File editsNew = getImageFile(sd, NameNodeFile.EDITS_NEW);
               if (editsNew.exists() && editsNew.length() > 0) {
          
          Show
          Colin Patrick McCabe added a comment - Hi Colin, thanks for posting a patch. As you mentioned, branch-1 is a stable branch so that it is better to minimize the change. Let's change the default toleration length to 0. Then, user could see all the length information before using recovery mode. Does it sound good to you? This seems reasonable to me. It seems like if we're going to go this route, specifying -recover should also cause edit.log.toleration to be set to -1 automatically?. Otherwise Recovery Mode will not work, and the user will probably be confused as to why. Perhaps a change like this? diff --git a/src/hdfs/org/apache/hadoop/hdfs/server/namenode/FSImage.java b/src/hdfs/org/apache/hadoop/hdfs/server/namenode/FSImage.java index e4e5045..82d70b4 100644 --- a/src/hdfs/org/apache/hadoop/hdfs/server/namenode/FSImage.java +++ b/src/hdfs/org/apache/hadoop/hdfs/server/namenode/FSImage.java @@ -1031,7 +1031,8 @@ public class FSImage extends Storage { int numEdits = 0; EditLogFileInputStream edits = new EditLogFileInputStream(getImageFile(sd, NameNodeFile.EDITS)); - numEdits = FSEditLog.loadFSEdits(edits, editsTolerationLength, recovery); + numEdits = FSEditLog.loadFSEdits(edits, + recovery == null ? editsTolerationLength : -1, recovery); edits.close(); File editsNew = getImageFile(sd, NameNodeFile.EDITS_NEW); if (editsNew.exists() && editsNew.length() > 0) {
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12546530/h3540_20120925.patch
          against trunk revision .

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3230//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12546530/h3540_20120925.patch against trunk revision . -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3230//console This message is automatically generated.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Hi Colin, thanks for posting a patch. As you mentioned, branch-1 is a stable branch so that it is better to minimize the change. Let's change the default toleration length to 0. Then, user could see all the length information before using recovery mode. Does it sound good to you?

          h3540_20120925.patch: changes the default toleration length to 0.

          Show
          Tsz Wo Nicholas Sze added a comment - Hi Colin, thanks for posting a patch. As you mentioned, branch-1 is a stable branch so that it is better to minimize the change. Let's change the default toleration length to 0. Then, user could see all the length information before using recovery mode. Does it sound good to you? h3540_20120925.patch: changes the default toleration length to 0.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12545467/HDFS-3540-b1.004.patch
          against trunk revision .

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3199//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12545467/HDFS-3540-b1.004.patch against trunk revision . -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3199//console This message is automatically generated.
          Hide
          Colin Patrick McCabe added a comment -

          This patch integrates edit log toleration with Recovery Mode, so that both can be turned on at the same time without interfering with one another. It also fixes a bug where we would not do HDFS-3335-style checking of the end of the log unless edit log toleration was enabled.

          This patch also contains a small improvement to Recovery Mode: RM now prints out the approximate remaining length of the edit log file. By this, I mean the length of the file that remains, minus the length of the padding bytes. This should help the system administrator get a better idea of how many bytes of edits he is discarding if he chooses "stop reading the edit log here."

          Show
          Colin Patrick McCabe added a comment - This patch integrates edit log toleration with Recovery Mode, so that both can be turned on at the same time without interfering with one another. It also fixes a bug where we would not do HDFS-3335 -style checking of the end of the log unless edit log toleration was enabled. This patch also contains a small improvement to Recovery Mode: RM now prints out the approximate remaining length of the edit log file. By this, I mean the length of the file that remains, minus the length of the padding bytes. This should help the system administrator get a better idea of how many bytes of edits he is discarding if he chooses "stop reading the edit log here."
          Hide
          Colin Patrick McCabe added a comment -

          Well, we do have access to the FileChannel, so seeking is possible.

          Show
          Colin Patrick McCabe added a comment - Well, we do have access to the FileChannel, so seeking is possible.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          For the suggestion on printing the corruption/padding lengths, it may not be possible in recovery mode unless it has two passes.

          Show
          Tsz Wo Nicholas Sze added a comment - For the suggestion on printing the corruption/padding lengths, it may not be possible in recovery mode unless it has two passes.
          Hide
          Colin Patrick McCabe added a comment -

          From what I understand based on previous comments, it allows an operator to continue with corrupt editlog or abort. Not sure if abort is really a choice. What would one do after abort?

          The first thing to try is moving aside the edit log directory that had the problem and seeing if you can reload with another one of the directories. If it's a random I/O corruption, normally only one of the copies of the edit log stored on disk will be bad. Since there's no edit log failover in branch-1, you have to do it yourself. If all the copies are corrupt, it may be necessary to use a hex editor on the edit log, or a similar technique. The offset of the failure is provided so you can check it out manually.

          Perhaps we should consider printing more information during recovery to help an admin understand the state of the editlog. Is that possible?

          Nicholas mentioned earlier that it might be helpful to print out how many bytes are left in the log-- even though this can be computed from the information provided, it could be helpful to be more explicit about it.

          There may be other information that can be printed out too-- I'll take a look.

          Show
          Colin Patrick McCabe added a comment - From what I understand based on previous comments, it allows an operator to continue with corrupt editlog or abort. Not sure if abort is really a choice. What would one do after abort? The first thing to try is moving aside the edit log directory that had the problem and seeing if you can reload with another one of the directories. If it's a random I/O corruption, normally only one of the copies of the edit log stored on disk will be bad. Since there's no edit log failover in branch-1, you have to do it yourself. If all the copies are corrupt, it may be necessary to use a hex editor on the edit log, or a similar technique. The offset of the failure is provided so you can check it out manually. Perhaps we should consider printing more information during recovery to help an admin understand the state of the editlog. Is that possible? Nicholas mentioned earlier that it might be helpful to print out how many bytes are left in the log-- even though this can be computed from the information provided, it could be helpful to be more explicit about it. There may be other information that can be printed out too-- I'll take a look.
          Hide
          Suresh Srinivas added a comment -

          Error toleration is a useful feature when namenode continues to function without manual intervention. This helps in HA setup. That said, I think recovery mode may be useful if a cluster admin choses to make it manual.

          From what I understand based on previous comments, it allows an operator to continue with corrupt editlog or abort. Not sure if abort is really a choice. What would one do after abort? To that end, Nicholas, some of the information you print such as the editlog length, corruption lenght and padding length etc. should be printed in recovery mode. This information will be useful when one wants to continue ignoring the corrupt part of the editlog.

          Given this, I would leave recovery mode alone and not remove it. Perhaps we should consider printing more information during recovery to help an admin understand the state of the editlog. Is that possible?

          Show
          Suresh Srinivas added a comment - Error toleration is a useful feature when namenode continues to function without manual intervention. This helps in HA setup. That said, I think recovery mode may be useful if a cluster admin choses to make it manual. From what I understand based on previous comments, it allows an operator to continue with corrupt editlog or abort. Not sure if abort is really a choice. What would one do after abort? To that end, Nicholas, some of the information you print such as the editlog length, corruption lenght and padding length etc. should be printed in recovery mode. This information will be useful when one wants to continue ignoring the corrupt part of the editlog. Given this, I would leave recovery mode alone and not remove it. Perhaps we should consider printing more information during recovery to help an admin understand the state of the editlog. Is that possible?
          Hide
          Colin Patrick McCabe added a comment -

          Hi Colin, you keep mentioning HDFS-3004 or the recovery mode feature in trunk. However, we are talking about branch-1 recovery mode here.

          The reason why I mentioned HDFS-3004 is because the original design doc contains a good explanation of why recovery mode should not be enabled in normal operation:

          Why can't we simply do recovery as part of normal NameNode operation?  Well,
          recovery may involve destructive changes to the NameNode metadata.  Since the
          metadata is corrupt, we will have to use guesswork to get back to a valid
          state.
          

          This issue is the same in both branch-1 and later branches: if you have to guess, you shouldn't make the process automatic.

          The branch-1 recovery mode feature is not yet released. If the new feature has problems, we should remove it. It is not a point if people already know how to use it. If there are people using development code, they have to get prepared that the un-released new feature may be changed or removed.

          It would be inconvenient for us to remove RM for branch-1. I am willing to consider it, but I just don't think the arguments presented here so far have been convincing.

          I think the first thing we need to answer is what is the use case for edit log toleration? What are your guidelines for when edit log toleration should be turned on? This has never been clear to me. It seems to me if you wanted to get higher availability, you would be better off implementing edit log failover in branch-1.

          At the very least, it would be nice to have a document explaining who the intended users are for edit log toleration, why they would use it rather than something else, and what the risks are. At that point we could start to consider what the best resolution for this is-- whatever that may be.

          Show
          Colin Patrick McCabe added a comment - Hi Colin, you keep mentioning HDFS-3004 or the recovery mode feature in trunk. However, we are talking about branch-1 recovery mode here. The reason why I mentioned HDFS-3004 is because the original design doc contains a good explanation of why recovery mode should not be enabled in normal operation: Why can't we simply do recovery as part of normal NameNode operation? Well, recovery may involve destructive changes to the NameNode metadata. Since the metadata is corrupt, we will have to use guesswork to get back to a valid state. This issue is the same in both branch-1 and later branches: if you have to guess, you shouldn't make the process automatic. The branch-1 recovery mode feature is not yet released. If the new feature has problems, we should remove it. It is not a point if people already know how to use it. If there are people using development code, they have to get prepared that the un-released new feature may be changed or removed. It would be inconvenient for us to remove RM for branch-1. I am willing to consider it, but I just don't think the arguments presented here so far have been convincing. I think the first thing we need to answer is what is the use case for edit log toleration? What are your guidelines for when edit log toleration should be turned on? This has never been clear to me. It seems to me if you wanted to get higher availability, you would be better off implementing edit log failover in branch-1. At the very least, it would be nice to have a document explaining who the intended users are for edit log toleration, why they would use it rather than something else, and what the risks are. At that point we could start to consider what the best resolution for this is-- whatever that may be.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Hi Colin, you keep mentioning HDFS-3004 or the recovery mode feature in trunk. However, we are talking about branch-1 recovery mode here.

          > ... There's many more convincing process-based arguments. branch-1 is a stable branch. We should be fixing bugs, not making major changes. We should be trying to minimize the divergence between branch-1 and branch-2, not amplify it. People already know how to use recovery mode. We're not going to retrain people to use an (in my opinion more error-prone) system that does the same thing.

          The branch-1 recovery mode feature is not yet released. If the new feature has problems, we should remove it. It is not a point if people already know how to use it. If there are people using development code, they have to get prepared that the un-released new feature may be changed or removed.

          Also, we don't not have to minimize the divergence between branch-1 and branch-2. We usually add new features to branch-2 but not branch-1.

          Show
          Tsz Wo Nicholas Sze added a comment - Hi Colin, you keep mentioning HDFS-3004 or the recovery mode feature in trunk. However, we are talking about branch-1 recovery mode here. > ... There's many more convincing process-based arguments. branch-1 is a stable branch. We should be fixing bugs, not making major changes. We should be trying to minimize the divergence between branch-1 and branch-2, not amplify it. People already know how to use recovery mode. We're not going to retrain people to use an (in my opinion more error-prone) system that does the same thing. The branch-1 recovery mode feature is not yet released. If the new feature has problems, we should remove it. It is not a point if people already know how to use it. If there are people using development code, they have to get prepared that the un-released new feature may be changed or removed. Also, we don't not have to minimize the divergence between branch-1 and branch-2. We usually add new features to branch-2 but not branch-1.
          Hide
          Colin Patrick McCabe added a comment -

          Let me go into a little more detail here.

          When we were originally talking about Recovery Mode, one big concern we had was that system administrators would overuse Recovery Mode to fix issues that might be better addressed in a different way. Of course, it's impossible to prevent all misuse-- human beings are not perfect, and any tool can be misused. That's the reason why we made recovery mode a startup option, rather than a configuration. It would be too easy for people to set the configuration and then leave it set even after the problem was gone. That's also the reason why an NameNode in RM exits as soon as it has loaded the edit log and written a new FSImage. This was all discussed in HDFS-3004.

          Obviously edit log toleration goes against those assumptions, and in a way that frankly, I think is very dangerous.

          Recovery Mode is generally an extensible concept. Since it has nothing to do with the physical structure of the edit log on-disk, it can be extended to handle arbitrary types of corruption. For example, what if you encounter an edit that relies on a directory that doesn't exist (because of corruption earlier in the log)? This is something that recovery mode could conceivably handle by displaying a prompt and asking "would you like to create the parent directory for the directory this edit references?"

          Edit Log Toleration is not extensible. It can only ever handle one type of corruption: tail corruption. But we rarely see tail corruption any more, since FSEditLog preallocation was improved in branch-1 (HDFS-3596). I can't think of a single case of tail corruption we've seen in the past few months. Many of the cases of corruption we've seen have been HDFS-3652, and edit log toleration is inherently useless for this purpose. Missing features can be fixed; inherent uselessness cannot.

          And these are just the technical arguments. There's many more convincing process-based arguments. branch-1 is a stable branch. We should be fixing bugs, not making major changes. We should be trying to minimize the divergence between branch-1 and branch-2, not amplify it. People already know how to use recovery mode. We're not going to retrain people to use an (in my opinion more error-prone) system that does the same thing.

          Let's just fix the bugs we have (I have pointed out some in this thread), get stuff working, and focus our efforts on the future not the past.

          Show
          Colin Patrick McCabe added a comment - Let me go into a little more detail here. When we were originally talking about Recovery Mode, one big concern we had was that system administrators would overuse Recovery Mode to fix issues that might be better addressed in a different way. Of course, it's impossible to prevent all misuse-- human beings are not perfect, and any tool can be misused. That's the reason why we made recovery mode a startup option, rather than a configuration. It would be too easy for people to set the configuration and then leave it set even after the problem was gone. That's also the reason why an NameNode in RM exits as soon as it has loaded the edit log and written a new FSImage. This was all discussed in HDFS-3004 . Obviously edit log toleration goes against those assumptions, and in a way that frankly, I think is very dangerous. Recovery Mode is generally an extensible concept. Since it has nothing to do with the physical structure of the edit log on-disk, it can be extended to handle arbitrary types of corruption. For example, what if you encounter an edit that relies on a directory that doesn't exist (because of corruption earlier in the log)? This is something that recovery mode could conceivably handle by displaying a prompt and asking "would you like to create the parent directory for the directory this edit references?" Edit Log Toleration is not extensible. It can only ever handle one type of corruption: tail corruption. But we rarely see tail corruption any more, since FSEditLog preallocation was improved in branch-1 ( HDFS-3596 ). I can't think of a single case of tail corruption we've seen in the past few months. Many of the cases of corruption we've seen have been HDFS-3652 , and edit log toleration is inherently useless for this purpose. Missing features can be fixed; inherent uselessness cannot. And these are just the technical arguments. There's many more convincing process-based arguments. branch-1 is a stable branch. We should be fixing bugs, not making major changes. We should be trying to minimize the divergence between branch-1 and branch-2, not amplify it. People already know how to use recovery mode. We're not going to retrain people to use an (in my opinion more error-prone) system that does the same thing. Let's just fix the bugs we have (I have pointed out some in this thread), get stuff working, and focus our efforts on the future not the past.
          Hide
          Colin Patrick McCabe added a comment -

          The main issue of branch-1 recovery mode is silent data loss.

          Recovery Mode prompts the sysadmin before doing anything. Edit Log Toleration never prompts.

          Recovery Mode can't be turned on in the configuration. Edit Log Toleration can only be turned on in the configuration.

          It is Edit Log Toleration that can lead to silent data loss. Recovery Mode never does.

          A minor issue is that it does not provide useful features: it only let admin to choose "stop reading" or "quit without saving". "Stop reading" can be done by setting toleration length and "quit without saving" simply means that NN cannot start up.

          If we need to add more functionality, then let's add more functionality, not remove it.

          Show
          Colin Patrick McCabe added a comment - The main issue of branch-1 recovery mode is silent data loss. Recovery Mode prompts the sysadmin before doing anything. Edit Log Toleration never prompts. Recovery Mode can't be turned on in the configuration. Edit Log Toleration can only be turned on in the configuration. It is Edit Log Toleration that can lead to silent data loss. Recovery Mode never does. A minor issue is that it does not provide useful features: it only let admin to choose "stop reading" or "quit without saving". "Stop reading" can be done by setting toleration length and "quit without saving" simply means that NN cannot start up. If we need to add more functionality, then let's add more functionality, not remove it.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          First of all, let's focus the discussion in branch-1 since edit log recovery mode feature in branch-1 and >= branch-2 are quite different.

          The main issue of branch-1 recovery mode is silent data loss. A minor issue is that it does not provide useful features: it only let admin to choose "stop reading" or "quit without saving". "Stop reading" can be done by setting toleration length and "quit without saving" simply means that NN cannot start up.

          For edit log toleration, admin could first set toleration length to zero. If corruption is found, the admin could change the toleration length accordingly.

          Show
          Tsz Wo Nicholas Sze added a comment - First of all, let's focus the discussion in branch-1 since edit log recovery mode feature in branch-1 and >= branch-2 are quite different. The main issue of branch-1 recovery mode is silent data loss. A minor issue is that it does not provide useful features: it only let admin to choose "stop reading" or "quit without saving". "Stop reading" can be done by setting toleration length and "quit without saving" simply means that NN cannot start up. For edit log toleration, admin could first set toleration length to zero. If corruption is found, the admin could change the toleration length accordingly.
          Hide
          Colin Patrick McCabe added a comment -

          Here's a table of the abilities of the two features (Recovery mode and edit log toleration):

            skip over edits discard the end of the edit log requires administrator intervention
          edit log toleration no yes no
          edit log recovery mode branch-2 and later yes yes

          We have considered enhancing RM in branch-1 so that it can skip over certain edits. When handling corruptions caused by HDFS-3652, it would have been nice to have this capability.

          Show
          Colin Patrick McCabe added a comment - Here's a table of the abilities of the two features (Recovery mode and edit log toleration):   skip over edits discard the end of the edit log requires administrator intervention edit log toleration no yes no edit log recovery mode branch-2 and later yes yes We have considered enhancing RM in branch-1 so that it can skip over certain edits. When handling corruptions caused by HDFS-3652 , it would have been nice to have this capability.
          Hide
          Colin Patrick McCabe added a comment -

          I should be more specific: For Recovery Mode without HDFS-3479, if there is corruption in the middle of the edit log and the user chooses "stop reading" in recovery mode, then the remaining data of the edit log will not be checked. Is it correct?

          Yes, if the user chooses "stop reading" in Recovery Mode, the rest of the edit log will be discarded.

          Show
          Colin Patrick McCabe added a comment - I should be more specific: For Recovery Mode without HDFS-3479 , if there is corruption in the middle of the edit log and the user chooses "stop reading" in recovery mode, then the remaining data of the edit log will not be checked. Is it correct? Yes, if the user chooses "stop reading" in Recovery Mode, the rest of the edit log will be discarded.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > Correct me if I am wrong: Recovery Mode without HDFS-3479 means the entire end-of-log is not checked and, therefore, the silent data loss length is not limited. It is even worst.

          No, that is incorrect.

          Recovery mode has always read up to the end of the log, and it always will. The confusion arises because sometimes we are not very good at determining where the "end of the log" is.

          I should be more specific: For Recovery Mode without HDFS-3479, if there is corruption in the middle of the edit log and the user chooses "stop reading" in recovery mode, then the remaining data of the edit log will not be checked. Is it correct?

          Show
          Tsz Wo Nicholas Sze added a comment - > Correct me if I am wrong: Recovery Mode without HDFS-3479 means the entire end-of-log is not checked and, therefore, the silent data loss length is not limited. It is even worst. No, that is incorrect. Recovery mode has always read up to the end of the log, and it always will. The confusion arises because sometimes we are not very good at determining where the "end of the log" is. I should be more specific: For Recovery Mode without HDFS-3479 , if there is corruption in the middle of the edit log and the user chooses "stop reading" in recovery mode, then the remaining data of the edit log will not be checked. Is it correct?
          Hide
          Colin Patrick McCabe added a comment -

          Correct me if I am wrong: Recovery Mode without HDFS-3479 means the entire end-of-log is not checked and, therefore, the silent data loss length is not limited. It is even worst.

          No, that is incorrect.

          Recovery mode has always read up to the end of the log, and it always will. The confusion arises because sometimes we are not very good at determining where the "end of the log" is.

          I filed and implemented HDFS-3479 because I noticed that in certain scenarios we would decide that the edit log ended before it really did because we spotted an OP_INVALID.

          The unchecked region which we've been discussing only applied to HDFS-3479 corruption, not to any other type of corruption. Frankly, the unchecked region was a mistake.

          However, none of this has anything to do with recovery mode. HDFS-3004 and HDFS-3479 were separate JIRAs, that implemented separate features.

          Show
          Colin Patrick McCabe added a comment - Correct me if I am wrong: Recovery Mode without HDFS-3479 means the entire end-of-log is not checked and, therefore, the silent data loss length is not limited. It is even worst. No, that is incorrect. Recovery mode has always read up to the end of the log, and it always will. The confusion arises because sometimes we are not very good at determining where the "end of the log" is. I filed and implemented HDFS-3479 because I noticed that in certain scenarios we would decide that the edit log ended before it really did because we spotted an OP_INVALID . The unchecked region which we've been discussing only applied to HDFS-3479 corruption, not to any other type of corruption. Frankly, the unchecked region was a mistake. However, none of this has anything to do with recovery mode. HDFS-3004 and HDFS-3479 were separate JIRAs, that implemented separate features.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          I want to emphasize one thing here: UNCHECKED_REGION_LENGTH is not part of Recovery Mode. If you look at the history FSEditLog.java, you'll see that change 1325075 (HDFS-3055) introduced Recovery mode, but not UNCHECKED_REGION_LENGTH. This was introduced in HDFS-3479 (the backport of HDFS_3335 to branch-1). ...

          Correct me if I am wrong: Recovery Mode without HDFS-3479 means the entire end-of-log is not checked and, therefore, the silent data loss length is not limited. It is even worst.

          The only feature in branch-1 Recovery Mode is to let user to choose "stop reading" or "quit without saving". "Stop reading" may lead to silent data loss. "Quit without saving" is the same as NN failing to start up. Thus, I suggest to remove branch-1 Recovery Mode

          The padding length is going to be a megabyte at most. Since the edit log files are fairly large, you should have a good idea of what percentage through the file you are. ...

          A megabyte could contain hundreds or thousands of edit log transactions. It is catastrophic in production clusters. Also, the percentage here may not be relevant since the transactions at the end are the latest transactions and, very likely, are the most important transactions.

          Show
          Tsz Wo Nicholas Sze added a comment - I want to emphasize one thing here: UNCHECKED_REGION_LENGTH is not part of Recovery Mode. If you look at the history FSEditLog.java, you'll see that change 1325075 ( HDFS-3055 ) introduced Recovery mode, but not UNCHECKED_REGION_LENGTH. This was introduced in HDFS-3479 (the backport of HDFS_3335 to branch-1). ... Correct me if I am wrong: Recovery Mode without HDFS-3479 means the entire end-of-log is not checked and, therefore, the silent data loss length is not limited. It is even worst. The only feature in branch-1 Recovery Mode is to let user to choose "stop reading" or "quit without saving". "Stop reading" may lead to silent data loss. "Quit without saving" is the same as NN failing to start up. Thus, I suggest to remove branch-1 Recovery Mode The padding length is going to be a megabyte at most. Since the edit log files are fairly large, you should have a good idea of what percentage through the file you are. ... A megabyte could contain hundreds or thousands of edit log transactions. It is catastrophic in production clusters. Also, the percentage here may not be relevant since the transactions at the end are the latest transactions and, very likely, are the most important transactions.
          Hide
          Colin Patrick McCabe added a comment -

          Before HDFS-3521, there is a UNCHECKED_REGION_LENGTH in Recovery Mode. If a stray OP_INVALID byte is within the unchecked region, it will cause silent data loss.

          Nicholas, you didn't address the main point of my comment, which is that after HDFS-3521, if a stray OP_INVALID byte is found anywhere in the log, it will cause silent data loss-- unless the sysadmin configures dfs.namenode.edits.toleration.length to something other than the default. Based on your earlier comments, I think we both agree that this should not be the default. Let's fix this (independently of everything else were discussing here.)

          You still do not know the corruption length since there may be padding at the end. System admins won't know the padding length and so they won't be able to know the corruption length.

          The padding length is going to be a megabyte at most. Since the edit log files are fairly large, you should have a good idea of what percentage through the file you are. If you have an idea for improving the error messages of FSEditLog.java, perhaps you should file a JIRA for that? It's not directly relevant here, though, since all methods of manual recovery will face the same issues.

          Before HDFS-3521, there is a UNCHECKED_REGION_LENGTH in Recovery Mode...

          I want to emphasize one thing here: UNCHECKED_REGION_LENGTH is not part of Recovery Mode. If you look at the history FSEditLog.java, you'll see that change 1325075 (HDFS-3055) introduced Recovery mode, but not UNCHECKED_REGION_LENGTH. This was introduced in HDFS-3479 (the backport of HDFS_3335 to branch-1). Please see this comment, introduced by HDFS-3479:

          +    /** The end of the edit log should contain only 0x00 or 0xff bytes.
          +     * If it contains other bytes, the log itself may be corrupt.
          +     * It is important to check this; if we don't, a stray OP_INVALID byte
          +     * could make us stop reading the edit log halfway through, and we'd never
          +     * know that we had lost data.
          +     *
          +     * We don't check the very last part of the edit log, in case the
          +     * NameNode crashed while writing to the edit log.
          +     */
          

          I encourage anyone interested in this to check out the history of FSEditLog.java. It's a very good guide and it will make understanding this discussion much easier.

          As I said before, I still think we should get rid of the unchecked region altogether. But this has nothing to do with Recovery Mode, it has to do with HDFS-3479.

          Show
          Colin Patrick McCabe added a comment - Before HDFS-3521 , there is a UNCHECKED_REGION_LENGTH in Recovery Mode. If a stray OP_INVALID byte is within the unchecked region, it will cause silent data loss. Nicholas, you didn't address the main point of my comment, which is that after HDFS-3521 , if a stray OP_INVALID byte is found anywhere in the log, it will cause silent data loss-- unless the sysadmin configures dfs.namenode.edits.toleration.length to something other than the default. Based on your earlier comments, I think we both agree that this should not be the default. Let's fix this (independently of everything else were discussing here.) You still do not know the corruption length since there may be padding at the end. System admins won't know the padding length and so they won't be able to know the corruption length. The padding length is going to be a megabyte at most. Since the edit log files are fairly large, you should have a good idea of what percentage through the file you are. If you have an idea for improving the error messages of FSEditLog.java , perhaps you should file a JIRA for that? It's not directly relevant here, though, since all methods of manual recovery will face the same issues. Before HDFS-3521 , there is a UNCHECKED_REGION_LENGTH in Recovery Mode... I want to emphasize one thing here: UNCHECKED_REGION_LENGTH is not part of Recovery Mode. If you look at the history FSEditLog.java , you'll see that change 1325075 ( HDFS-3055 ) introduced Recovery mode, but not UNCHECKED_REGION_LENGTH . This was introduced in HDFS-3479 (the backport of HDFS_3335 to branch-1). Please see this comment, introduced by HDFS-3479 : + /** The end of the edit log should contain only 0x00 or 0xff bytes. + * If it contains other bytes, the log itself may be corrupt. + * It is important to check this ; if we don't, a stray OP_INVALID byte + * could make us stop reading the edit log halfway through, and we'd never + * know that we had lost data. + * + * We don't check the very last part of the edit log, in case the + * NameNode crashed while writing to the edit log. + */ I encourage anyone interested in this to check out the history of FSEditLog.java . It's a very good guide and it will make understanding this discussion much easier. As I said before, I still think we should get rid of the unchecked region altogether. But this has nothing to do with Recovery Mode, it has to do with HDFS-3479 .
          Hide
          Colin Patrick McCabe added a comment -

          It seems to me that recovery mode and edit log toleration serve different purposes. The latter is necessary for an HA setup, where admin explicitly set a small toleration length for tail corruption. The former is useless in an HA setup and suitable for manual recovery.

          "Edit log toleration" is not necessary for an HA setup. In fact, it is impossible to configure "edit log toleration" together with an HA setup, because "edit log toleration" is only available in branch-1 (but not later branches), and HA is only available in branch-2 and later.

          Edit log toleration is adequate as is. Recovery mode needs more patches (more details of errors etc.) to serve the interactive recovery use case better.

          Patches are welcome. Check out the design doc for HDFS-3004, which gives an overview:
          https://issues.apache.org/jira/secure/attachment/12542798/recovery-mode.pdf

          Show
          Colin Patrick McCabe added a comment - It seems to me that recovery mode and edit log toleration serve different purposes. The latter is necessary for an HA setup, where admin explicitly set a small toleration length for tail corruption. The former is useless in an HA setup and suitable for manual recovery. "Edit log toleration" is not necessary for an HA setup. In fact, it is impossible to configure "edit log toleration" together with an HA setup, because "edit log toleration" is only available in branch-1 (but not later branches), and HA is only available in branch-2 and later. Edit log toleration is adequate as is. Recovery mode needs more patches (more details of errors etc.) to serve the interactive recovery use case better. Patches are welcome. Check out the design doc for HDFS-3004 , which gives an overview: https://issues.apache.org/jira/secure/attachment/12542798/recovery-mode.pdf
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Recovery mode will always prompt before doing anything which could lead to data loss. So no, stray OP_INVALID bytes will not lead to silent data loss.

          Actually, looking at change 1349086, which was introduced by HDFS-3521, I see that it broke end-of-file checking by default. Since dfs.namenode.edits.toleration.length is -1 by default, FSEditLog#checkEndOfLog is never invoked. However, this is not a problem with Recovery Mode; it's a problem with change 1349086.

          Before HDFS-3521, there is a UNCHECKED_REGION_LENGTH in Recovery Mode. If a stray OP_INVALID byte is within the unchecked region, it will cause silent data loss.

          Recovery Mode does consider the corruption length. The location at which the problem occurred is printed out. This is the message "Failed to parse edit log (<file name>) at position <position>, edit log length is <length>..." This information is provided to allow the system administrator to make an informed decision.

          You still do not know the corruption length since there may be padding at the end. System admins won't know the padding length and so they won't be able to know the corruption length.

          Show
          Tsz Wo Nicholas Sze added a comment - Recovery mode will always prompt before doing anything which could lead to data loss. So no, stray OP_INVALID bytes will not lead to silent data loss. Actually, looking at change 1349086, which was introduced by HDFS-3521 , I see that it broke end-of-file checking by default. Since dfs.namenode.edits.toleration.length is -1 by default, FSEditLog#checkEndOfLog is never invoked. However, this is not a problem with Recovery Mode; it's a problem with change 1349086. Before HDFS-3521 , there is a UNCHECKED_REGION_LENGTH in Recovery Mode. If a stray OP_INVALID byte is within the unchecked region, it will cause silent data loss. Recovery Mode does consider the corruption length. The location at which the problem occurred is printed out. This is the message "Failed to parse edit log (<file name>) at position <position>, edit log length is <length>..." This information is provided to allow the system administrator to make an informed decision. You still do not know the corruption length since there may be padding at the end. System admins won't know the padding length and so they won't be able to know the corruption length.
          Hide
          Luke Lu added a comment -

          It seems to me that recovery mode and edit log toleration serve different purposes. The latter is necessary for an HA setup, where admin explicitly set a small toleration length for tail corruption. The former is useless in an HA setup and suitable for manual recovery.

          Edit log toleration is adequate as is. Recovery mode needs more patches (more details of errors etc.) to serve the interactive recovery use case better.

          Show
          Luke Lu added a comment - It seems to me that recovery mode and edit log toleration serve different purposes. The latter is necessary for an HA setup, where admin explicitly set a small toleration length for tail corruption. The former is useless in an HA setup and suitable for manual recovery. Edit log toleration is adequate as is. Recovery mode needs more patches (more details of errors etc.) to serve the interactive recovery use case better.
          Hide
          Colin Patrick McCabe added a comment -

          If I have not missed anything, there are two risks in the branch-1 Recovery Mode feature: If there is a stray OP_INVALID byte, it could be misinterpreted as an end-of-log and lead to silent data loss.

          Recovery mode will always prompt before doing anything which could lead to data loss. So no, stray OP_INVALID bytes will not lead to silent data loss.

          Actually, looking at change 1349086, which was introduced by HDFS-3521, I see that it broke end-of-file checking by default. Since dfs.namenode.edits.toleration.length is -1 by default, FSEditLog#checkEndOfLog is never invoked. However, this is not a problem with Recovery Mode; it's a problem with change 1349086.

          Recovery Mode does not consider the corruption length.

          Recovery Mode does consider the corruption length. The location at which the problem occurred is printed out. This is the message "Failed to parse edit log (<file name>) at position <position>, edit log length is <length>..." This information is provided to allow the system administrator to make an informed decision.

          Therefore, I suggest to remove Recovery Mode from branch-1 and change the default toleration length to 0.

          Recovery mode has already proven itself useful in the field in code lines derived from branch-1. I don't see any reason to remove it.

          I agree that dfs.namenode.edits.toleration.length should be 0 by default.

          At the end of the day, both edit log toleration and Recovery Mode can cause data loss. The difference is that Recovery Mode will prompt the system administrator before hand, and edit log toleration will not. This is the reason why I opposed edit log toleration originally, and it's the reason why I believe it should be off by default now. Silent data loss is not a feature-- not one that we want, anyway.

          Show
          Colin Patrick McCabe added a comment - If I have not missed anything, there are two risks in the branch-1 Recovery Mode feature: If there is a stray OP_INVALID byte, it could be misinterpreted as an end-of-log and lead to silent data loss. Recovery mode will always prompt before doing anything which could lead to data loss. So no, stray OP_INVALID bytes will not lead to silent data loss. Actually, looking at change 1349086, which was introduced by HDFS-3521 , I see that it broke end-of-file checking by default. Since dfs.namenode.edits.toleration.length is -1 by default, FSEditLog#checkEndOfLog is never invoked. However, this is not a problem with Recovery Mode; it's a problem with change 1349086. Recovery Mode does not consider the corruption length. Recovery Mode does consider the corruption length. The location at which the problem occurred is printed out. This is the message "Failed to parse edit log (<file name>) at position <position>, edit log length is <length>..." This information is provided to allow the system administrator to make an informed decision. Therefore, I suggest to remove Recovery Mode from branch-1 and change the default toleration length to 0. Recovery mode has already proven itself useful in the field in code lines derived from branch-1. I don't see any reason to remove it. I agree that dfs.namenode.edits.toleration.length should be 0 by default. At the end of the day, both edit log toleration and Recovery Mode can cause data loss. The difference is that Recovery Mode will prompt the system administrator before hand, and edit log toleration will not. This is the reason why I opposed edit log toleration originally, and it's the reason why I believe it should be off by default now. Silent data loss is not a feature-- not one that we want, anyway.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          If I have not missed anything, there are two risks in the branch-1 Recovery Mode feature:

          1. If there is a stray OP_INVALID byte, it could be misinterpreted as an end-of-log and lead to silent data loss.
          2. Recovery Mode does not consider the corruption length. If an edit log is corrupted in the beginning and the admin mistakenly selects "stop reading" in Recovery Mode, then a large portion of the edit log is ignored. It could cause unnecessary data loss even if the edit log has been backed up since datanodes will delete data. In many cases, such data loss could be prevented or reduced because the edit log could possibly be recovered by other means. This case arguably is an operation mistake. However, Recovery Mode enables such mistake.

          The Edit Log Toleration feature does not have these two risks if the toleration length is set to 0 (or a small number). Edit Log Toleration always checks all bytes in the edit log, so #1 won't happen. For #2, the length of corrupted data being tolerated is limited by the toleration length. If an edit log is corrupted in the beginning and the corrupted length is large, then it will throw an exception.

          Therefore, I suggest to remove Recovery Mode from branch-1 and change the default toleration length to 0.

          Show
          Tsz Wo Nicholas Sze added a comment - If I have not missed anything, there are two risks in the branch-1 Recovery Mode feature: If there is a stray OP_INVALID byte, it could be misinterpreted as an end-of-log and lead to silent data loss. Recovery Mode does not consider the corruption length. If an edit log is corrupted in the beginning and the admin mistakenly selects "stop reading" in Recovery Mode, then a large portion of the edit log is ignored. It could cause unnecessary data loss even if the edit log has been backed up since datanodes will delete data. In many cases, such data loss could be prevented or reduced because the edit log could possibly be recovered by other means. This case arguably is an operation mistake. However, Recovery Mode enables such mistake. The Edit Log Toleration feature does not have these two risks if the toleration length is set to 0 (or a small number). Edit Log Toleration always checks all bytes in the edit log, so #1 won't happen. For #2, the length of corrupted data being tolerated is limited by the toleration length. If an edit log is corrupted in the beginning and the corrupted length is large, then it will throw an exception. Therefore, I suggest to remove Recovery Mode from branch-1 and change the default toleration length to 0.
          Hide
          Colin Patrick McCabe added a comment -

          Hi Nicholas,

          Your summary seems reasonable to me overall. I agree with you that the recommended setting for edit log toleration should be disabled. Is there anything left to do for this JIRA?

          Show
          Colin Patrick McCabe added a comment - Hi Nicholas, Your summary seems reasonable to me overall. I agree with you that the recommended setting for edit log toleration should be disabled. Is there anything left to do for this JIRA?
          Hide
          Tsz Wo Nicholas Sze added a comment -

          If the edit log is not corrupted, both recovery mode and edit log toleration are not useful. Note that recovery mode here means recovery mode in branch-1 but not the one in trunk.

          When an edit log is corrupted, NN cannot start up normally. We compare recovery mode and edit log toleration below.

          Recovery Mode

          • "Recovery" here means starting NN with a corrupted edit log. It is unable to recover the corrupted edit log or transactions.
          • There is a namenode command option "hadoop namenode -recover" to enter recovery mode.
          • When reading the first corrupted transaction in the edit log, it prompts the admin to either "stop reading" or "quit without saving".
          • If "stop reading" is selected, NN ignores the remaining edit log (from the first corrupted transaction to the end of the edit log) and then starts up as usual.
          • There is a "-force" option to FORCE_FIRST_CHOICE, i.e. it is a non-interactive mode.
          • If there is a stray OP_INVALID byte, it could be misinterpreted as an end-of-log and lead to silent data loss. Recovery Mode does not help.
            (Please help out if I have missed anything.)

          Edit Log Toleration

          • It has a conf property "dfs.namenode.edits.toleration.length" for setting the toleration length.
          • The default toleration length is -1, i.e. disable it. The feature is enabled when the value >= 0.
          • When the feature is enabled, it always reads the entire edit log, computes read length, corruption length and padding length and shows the following summary
            2012-08-27 22:04:38,625 INFO  - Checked the bytes after the end of edit log
              (/Users/szetszwo/hadoop/b-1/build/test/data/dfs/name1/current/edits):
            2012-08-27 22:04:38,625 INFO  -   Padding position  = 876 (-1 means padding not found)
            2012-08-27 22:04:38,625 INFO  -   Edit log length   = 1065
            2012-08-27 22:04:38,625 INFO  -   Read length       = 168
            2012-08-27 22:04:38,625 INFO  -   Corruption length = 708
            2012-08-27 22:04:38,625 INFO  -   Toleration length = 1024 (= dfs.namenode.edits.toleration.length)
            2012-08-27 22:04:38,626 INFO  - Summary: |---------- Read=168 ----------|-- Corrupt=708 --|-- Pad=189 --|
            2012-08-27 22:04:38,626 WARN  - Edit log corruption detected:
              corruption length = 708 <= toleration length = 1024; the corruption is tolerable.
            
          • When toleration length is set to 0, it makes sure that there is no corruption in the entire log, including the padding. A stray OP_INVALID byte won't be misinterpreted as an end-of-log.
          • When toleration length is set to >0, NN starts up only if corruption length <= toleration length. If corruption length > toleration length, it throws an exception as below
            2012-08-27 22:04:39,123 INFO  - Start checking end of edit log (/Users/szetszwo/hadoop/b-1/build/test/data/dfs/name1/current/edits) ...
            2012-08-27 22:04:39,123 DEBUG - found: bytes[0]=0xFF=pad, firstPadPos=169
            2012-08-27 22:04:39,123 DEBUG - reset: bytes[1410]=0xAB, pad=0xFF
            2012-08-27 22:04:39,124 DEBUG - found: bytes[1411]=0xFF=pad, firstPadPos=1580
            2012-08-27 22:04:39,124 INFO  - Checked the bytes after the end of edit log (/Users/szetszwo/hadoop/b-1/build/test/data/dfs/name1/current/edits):
            2012-08-27 22:04:39,124 INFO  -   Padding position  = 1580 (-1 means padding not found)
            2012-08-27 22:04:39,124 INFO  -   Edit log length   = 2638
            2012-08-27 22:04:39,124 INFO  -   Read length       = 169
            2012-08-27 22:04:39,124 INFO  -   Corruption length = 1411
            2012-08-27 22:04:39,124 INFO  -   Toleration length = 1024 (= dfs.namenode.edits.toleration.length)
            2012-08-27 22:04:39,125 INFO  - Summary: |---------- Read=169 ----------|-- Corrupt=1411 --|-- Pad=1058 --|
            2012-08-27 22:04:39,125 ERROR - FSNamesystem initialization failed.
            java.io.IOException: Edit log corruption detected:
              corruption length = 1411 > toleration length = 1024; the corruption is intolerable.
            	at org.apache.hadoop.hdfs.server.namenode.FSEditLog.checkEndOfLog(FSEditLog.java:609)
            	...
            
          • Therefore, the recommanded setting is to set the conf to 0 (or a small number). When corruption is detected (i.e. NN cannot start up), the corruption length can be read from the log. Then, admin could decide whether they are willing to tolerate the corruption or they could try to recover the edit log by other means.
          Show
          Tsz Wo Nicholas Sze added a comment - If the edit log is not corrupted, both recovery mode and edit log toleration are not useful. Note that recovery mode here means recovery mode in branch-1 but not the one in trunk. When an edit log is corrupted, NN cannot start up normally. We compare recovery mode and edit log toleration below. Recovery Mode "Recovery" here means starting NN with a corrupted edit log. It is unable to recover the corrupted edit log or transactions. There is a namenode command option "hadoop namenode -recover" to enter recovery mode. When reading the first corrupted transaction in the edit log, it prompts the admin to either "stop reading" or "quit without saving". If "stop reading" is selected, NN ignores the remaining edit log (from the first corrupted transaction to the end of the edit log) and then starts up as usual. There is a "-force" option to FORCE_FIRST_CHOICE, i.e. it is a non-interactive mode. If there is a stray OP_INVALID byte, it could be misinterpreted as an end-of-log and lead to silent data loss. Recovery Mode does not help. (Please help out if I have missed anything.) Edit Log Toleration It has a conf property "dfs.namenode.edits.toleration.length" for setting the toleration length. The default toleration length is -1, i.e. disable it. The feature is enabled when the value >= 0. When the feature is enabled, it always reads the entire edit log, computes read length, corruption length and padding length and shows the following summary 2012-08-27 22:04:38,625 INFO - Checked the bytes after the end of edit log (/Users/szetszwo/hadoop/b-1/build/test/data/dfs/name1/current/edits): 2012-08-27 22:04:38,625 INFO - Padding position = 876 (-1 means padding not found) 2012-08-27 22:04:38,625 INFO - Edit log length = 1065 2012-08-27 22:04:38,625 INFO - Read length = 168 2012-08-27 22:04:38,625 INFO - Corruption length = 708 2012-08-27 22:04:38,625 INFO - Toleration length = 1024 (= dfs.namenode.edits.toleration.length) 2012-08-27 22:04:38,626 INFO - Summary: |---------- Read=168 ----------|-- Corrupt=708 --|-- Pad=189 --| 2012-08-27 22:04:38,626 WARN - Edit log corruption detected: corruption length = 708 <= toleration length = 1024; the corruption is tolerable. When toleration length is set to 0, it makes sure that there is no corruption in the entire log, including the padding. A stray OP_INVALID byte won't be misinterpreted as an end-of-log. When toleration length is set to >0, NN starts up only if corruption length <= toleration length. If corruption length > toleration length, it throws an exception as below 2012-08-27 22:04:39,123 INFO - Start checking end of edit log (/Users/szetszwo/hadoop/b-1/build/test/data/dfs/name1/current/edits) ... 2012-08-27 22:04:39,123 DEBUG - found: bytes[0]=0xFF=pad, firstPadPos=169 2012-08-27 22:04:39,123 DEBUG - reset: bytes[1410]=0xAB, pad=0xFF 2012-08-27 22:04:39,124 DEBUG - found: bytes[1411]=0xFF=pad, firstPadPos=1580 2012-08-27 22:04:39,124 INFO - Checked the bytes after the end of edit log (/Users/szetszwo/hadoop/b-1/build/test/data/dfs/name1/current/edits): 2012-08-27 22:04:39,124 INFO - Padding position = 1580 (-1 means padding not found) 2012-08-27 22:04:39,124 INFO - Edit log length = 2638 2012-08-27 22:04:39,124 INFO - Read length = 169 2012-08-27 22:04:39,124 INFO - Corruption length = 1411 2012-08-27 22:04:39,124 INFO - Toleration length = 1024 (= dfs.namenode.edits.toleration.length) 2012-08-27 22:04:39,125 INFO - Summary: |---------- Read=169 ----------|-- Corrupt=1411 --|-- Pad=1058 --| 2012-08-27 22:04:39,125 ERROR - FSNamesystem initialization failed. java.io.IOException: Edit log corruption detected: corruption length = 1411 > toleration length = 1024; the corruption is intolerable. at org.apache.hadoop.hdfs.server.namenode.FSEditLog.checkEndOfLog(FSEditLog.java:609) ... Therefore, the recommanded setting is to set the conf to 0 (or a small number). When corruption is detected (i.e. NN cannot start up), the corruption length can be read from the log. Then, admin could decide whether they are willing to tolerate the corruption or they could try to recover the edit log by other means.
          Hide
          Colin Patrick McCabe added a comment -

          Why and How does one recover from such condition?

          Using recovery mode.

          We've used this in the field with customer problems, and it has worked very well.

          Show
          Colin Patrick McCabe added a comment - Why and How does one recover from such condition? Using recovery mode. We've used this in the field with customer problems, and it has worked very well.
          Hide
          Suresh Srinivas added a comment -

          I think rather than tolerating a configurable amount of edit log corruption, we should just never tolerate any corruption.

          Why and How does one recover from such condition?

          Show
          Suresh Srinivas added a comment - I think rather than tolerating a configurable amount of edit log corruption, we should just never tolerate any corruption. Why and How does one recover from such condition?
          Hide
          Colin Patrick McCabe added a comment -

          Colin: ... I do not think any people like that exist!

          Nicholas: As mentioned by Suresh, edit log toleration is a very useful [feature].

          Under what circumstances would you turn this on?

          Show
          Colin Patrick McCabe added a comment - Colin: ... I do not think any people like that exist! Nicholas: As mentioned by Suresh, edit log toleration is a very useful [feature] . Under what circumstances would you turn this on?
          Hide
          Colin Patrick McCabe added a comment -

          Let's keep our discussion on JIRA so that other people could comment on it. I have filed HDFS-3540 for this purpose.

          Ok. Since I noticed that there are a bunch of people watching HDFS-3521, but nobody watching HDFS-3540 (yet?), I hope you don't mind if I repost this comment twice. I do apologize for the potential inbox spam.

          I think rather than tolerating a configurable amount of edit log corruption, we should just never tolerate any corruption.

          Basically, what I am saying is that UNCHECKED_REGION_LENGTH was a mistake. It was my mistake, and I don't blame you for trying to fix it. However, the current fix still assumes that there is someone who would want UNCHECKED_REGION_LENGTH. I do not think any people like that exist!

          Does that make sense? I can post a patch if that would be helpful.

          Show
          Colin Patrick McCabe added a comment - Let's keep our discussion on JIRA so that other people could comment on it. I have filed HDFS-3540 for this purpose. Ok. Since I noticed that there are a bunch of people watching HDFS-3521 , but nobody watching HDFS-3540 (yet?), I hope you don't mind if I repost this comment twice. I do apologize for the potential inbox spam. I think rather than tolerating a configurable amount of edit log corruption, we should just never tolerate any corruption. Basically, what I am saying is that UNCHECKED_REGION_LENGTH was a mistake. It was my mistake, and I don't blame you for trying to fix it. However, the current fix still assumes that there is someone who would want UNCHECKED_REGION_LENGTH. I do not think any people like that exist! Does that make sense? I can post a patch if that would be helpful.

            People

            • Assignee:
              Tsz Wo Nicholas Sze
              Reporter:
              Tsz Wo Nicholas Sze
            • Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development