Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-3521

Allow namenode to tolerate edit log corruption

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.0.3
    • Fix Version/s: 1.2.0
    • Component/s: namenode
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      HDFS-3479 adds checking for edit log corruption. It uses a fixed UNCHECKED_REGION_LENGTH (=PREALLOCATION_LENGTH) so that the bytes at the end within the length is not checked. Instead of not checking the bytes, we should check everything and allow toleration.

      1. h3521_20120611_b-1.patch
        41 kB
        Tsz Wo Nicholas Sze
      2. h3521_20120611_b-1.0.patch
        38 kB
        Tsz Wo Nicholas Sze
      3. h3521_20120610_b-1.patch
        37 kB
        Tsz Wo Nicholas Sze

        Issue Links

          Activity

          Hide
          Tsz Wo Nicholas Sze added a comment -

          h3521_20120610_b-1.patch: 1st patch.

          Show
          Tsz Wo Nicholas Sze added a comment - h3521_20120610_b-1.patch: 1st patch.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Ran all unit tests. TestFileCreation failed with and without the patch; filed HDFS-3523.

          Show
          Tsz Wo Nicholas Sze added a comment - Ran all unit tests. TestFileCreation failed with and without the patch; filed HDFS-3523 .
          Hide
          Tsz Wo Nicholas Sze added a comment -

          h3521_20120611_b-1.patch: SecondaryNameNode should not have any toleration. Thanks Suresh for pointing it out.

          Show
          Tsz Wo Nicholas Sze added a comment - h3521_20120611_b-1.patch: SecondaryNameNode should not have any toleration. Thanks Suresh for pointing it out.
          Hide
          Suresh Srinivas added a comment -

          +1 for the patch. Thanks for fixing the secondary namenode issue.

          Show
          Suresh Srinivas added a comment - +1 for the patch. Thanks for fixing the secondary namenode issue.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Ran all the tests again. The result was the same. Also checked it with test-patch.

          I have committed this to branch-1.

          Show
          Tsz Wo Nicholas Sze added a comment - Ran all the tests again. The result was the same. Also checked it with test-patch. I have committed this to branch-1.
          Hide
          Colin Patrick McCabe added a comment -

          In general, this is an extremely large patch that has at least a few different major components:
          1. add in.mark(TRANSACTION_LENGTH_LIMIT = 4GB).
          2. add a configuration option for "edits toleration length."
          3. some generic code cleanups (add a LOG object to FSEditLog, re-arrange import directives in FSEditLog to avoid import *, rename EditLogOutputStream#fill to EditLogOutputStream#PREALLOCATION_BUFFER)
          4. disable edit log toleration for the SecondaryNameNode

          I think any one of these could have been a separate patch. Putting them all together doesn't seem wise. But let me go step-by-step:

          1. add in.mark(TRANSACTION_LENGTH_LIMIT = 4GB).
          What's the rationale behind this change? It seems to break recovery mode, unless "edit log toleration" is explicitly turned off. Also, because of the way Java streams are implemented, this potentially means that we could have up to a 4GB buffer.

          This diverges from the way we handle a similar problem in upstream. Check out the StreamLimiter interface if you want to see how upstream handles a similar problem (but again, I'm not sure what problem this is solving, so this is just a guess on my part.)

          2. add a configuration option for "edits toleration length."

          The UNCHECKED_REGION_LENGTH was carefully selected to be the same as the preallocation length. The issues is that if the preallocation length is 1MB, you know that you will never encounter a partial write more than 1MB from the end of the file.

          Reasoning: if there was a partial write, either:
          1. It is in a region which you previously pre-allocated. That means it's in the last 1MB of the file, since we never allocate more than 1MB.
          2. It is NOT in a region which you previously pre-allocated. This means it almost certainly is at the very end of the file.

          Case #2 should almost never happen in practice, unless the NameNode writes an extremely huge batch of edit log operations. In both cases, the partial write should be in the last 1MB. That's the reason why the constant was chosen. It is not something that the user can or should configure.

          3. code cleanups
          Some of these are ok in principle, but they should be in a separate change. Also, this patch introduces a few misspellings.

          4. disable edit log toleration for the SecondaryNameNode
          I'm ok with this, but we don't need the rest of this patch.

          This patch is seriously huge, combines unrelated things, breaks recovery mode, and diverges from upstream. I think we need to revisit this.

          Show
          Colin Patrick McCabe added a comment - In general, this is an extremely large patch that has at least a few different major components: 1. add in.mark(TRANSACTION_LENGTH_LIMIT = 4GB). 2. add a configuration option for "edits toleration length." 3. some generic code cleanups (add a LOG object to FSEditLog, re-arrange import directives in FSEditLog to avoid import *, rename EditLogOutputStream#fill to EditLogOutputStream#PREALLOCATION_BUFFER) 4. disable edit log toleration for the SecondaryNameNode I think any one of these could have been a separate patch. Putting them all together doesn't seem wise. But let me go step-by-step: 1. add in.mark(TRANSACTION_LENGTH_LIMIT = 4GB). What's the rationale behind this change? It seems to break recovery mode, unless "edit log toleration" is explicitly turned off. Also, because of the way Java streams are implemented, this potentially means that we could have up to a 4GB buffer. This diverges from the way we handle a similar problem in upstream. Check out the StreamLimiter interface if you want to see how upstream handles a similar problem (but again, I'm not sure what problem this is solving, so this is just a guess on my part.) 2. add a configuration option for "edits toleration length." The UNCHECKED_REGION_LENGTH was carefully selected to be the same as the preallocation length. The issues is that if the preallocation length is 1MB, you know that you will never encounter a partial write more than 1MB from the end of the file. Reasoning: if there was a partial write, either: 1. It is in a region which you previously pre-allocated. That means it's in the last 1MB of the file, since we never allocate more than 1MB. 2. It is NOT in a region which you previously pre-allocated. This means it almost certainly is at the very end of the file. Case #2 should almost never happen in practice, unless the NameNode writes an extremely huge batch of edit log operations. In both cases, the partial write should be in the last 1MB. That's the reason why the constant was chosen. It is not something that the user can or should configure. 3. code cleanups Some of these are ok in principle, but they should be in a separate change. Also, this patch introduces a few misspellings. 4. disable edit log toleration for the SecondaryNameNode I'm ok with this, but we don't need the rest of this patch. This patch is seriously huge, combines unrelated things, breaks recovery mode, and diverges from upstream. I think we need to revisit this.
          Hide
          Aaron T. Myers added a comment -

          I also don't understand why this patch was only committed to branch-1. Seems like whatever issue(s) this addresses should also be present in trunk/branch-2, unless I'm missing something.

          Show
          Aaron T. Myers added a comment - I also don't understand why this patch was only committed to branch-1. Seems like whatever issue(s) this addresses should also be present in trunk/branch-2, unless I'm missing something.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > I also don't understand why this patch was only committed to branch-1. ...

          It is because only branch-1 has the UNCHECKED_REGION_LENGTH stuff as Colin mentioned earlier.

          Show
          Tsz Wo Nicholas Sze added a comment - > I also don't understand why this patch was only committed to branch-1. ... It is because only branch-1 has the UNCHECKED_REGION_LENGTH stuff as Colin mentioned earlier.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > In general, this is an extremely large patch ...

          41kB is extremely large? Really?

          > ... this potentially means that we could have up to a 4GB buffer.

          It happens only if there is a single 4GB transaction.

          > ... you will never encounter a partial write more than 1MB from the end of the file.

          The edit log may be corrupted.

          > ..., but they should be in a separate change. Also, this patch introduces a few misspellings.

          Why they should be separated? They are only simple cleanup. For the misspellings, it would be great if you can point out.

          Show
          Tsz Wo Nicholas Sze added a comment - > In general, this is an extremely large patch ... 41kB is extremely large? Really? > ... this potentially means that we could have up to a 4GB buffer. It happens only if there is a single 4GB transaction. > ... you will never encounter a partial write more than 1MB from the end of the file. The edit log may be corrupted. > ..., but they should be in a separate change. Also, this patch introduces a few misspellings. Why they should be separated? They are only simple cleanup. For the misspellings, it would be great if you can point out.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          h3521_20120611_b-1.0.patch: for branch-1.0 (may not be committed.)

          Show
          Tsz Wo Nicholas Sze added a comment - h3521_20120611_b-1.0.patch: for branch-1.0 (may not be committed.)
          Hide
          Colin Patrick McCabe added a comment -

          cmccabe: ... you will never encounter a partial write more than 1MB from the end of the file.

          szetswo: The edit log may be corrupted.

          If the edit log is corrupted, we never want to ignore the corruption.

          UNCHECKED_REGION_LENGTH was never about ignoring corruption in general. It was about ignoring sentinel bytes followed by non-sentinel data.

          I think we need to revert this and have a longer discussion about this. I'm really scared by what this change could do. I know for sure that it breaks recovery mode, and it also appears to degrade edit log integrity too.

          Show
          Colin Patrick McCabe added a comment - cmccabe: ... you will never encounter a partial write more than 1MB from the end of the file. szetswo: The edit log may be corrupted. If the edit log is corrupted, we never want to ignore the corruption. UNCHECKED_REGION_LENGTH was never about ignoring corruption in general. It was about ignoring sentinel bytes followed by non-sentinel data. I think we need to revert this and have a longer discussion about this. I'm really scared by what this change could do. I know for sure that it breaks recovery mode, and it also appears to degrade edit log integrity too.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          If corruption toleration is undesirable, the feature could be turned off.

          UNCHECKED_REGION_LENGTH is to skip checking the last 1MB. If this region is checked, corruption could be found. Skipping checking may simply lead to undetected corruption but not the same as no corruption.

          Show
          Tsz Wo Nicholas Sze added a comment - If corruption toleration is undesirable, the feature could be turned off. UNCHECKED_REGION_LENGTH is to skip checking the last 1MB. If this region is checked, corruption could be found. Skipping checking may simply lead to undetected corruption but not the same as no corruption.
          Hide
          Colin Patrick McCabe added a comment -

          Nicholas, nobody wants corruption to be tolerated. Nobody is going to turn this on. The whole point of HDFS is reliability, and this is negated if we start tolerating corruption in the edit log.

          When we came up with recovery mode, we very specifically made it optional, not something that was on by default. You were a part of those discussions too. You should consult the design documents to see the choices we made at that time.

          We have to revert this because tolerating corruption is unacceptable. Also, this patch breaks recovery mode unless toleration is turned off. Broken patches have to get reverted!

          I'm sure we can come up with something that addresses your concerns. But let's take our time and do it right. branch-1 is a mature branch which should only be accepting very well-reviewed and conservative changes at this point.

          Show
          Colin Patrick McCabe added a comment - Nicholas, nobody wants corruption to be tolerated. Nobody is going to turn this on. The whole point of HDFS is reliability, and this is negated if we start tolerating corruption in the edit log. When we came up with recovery mode, we very specifically made it optional, not something that was on by default. You were a part of those discussions too. You should consult the design documents to see the choices we made at that time. We have to revert this because tolerating corruption is unacceptable. Also, this patch breaks recovery mode unless toleration is turned off. Broken patches have to get reverted! I'm sure we can come up with something that addresses your concerns. But let's take our time and do it right. branch-1 is a mature branch which should only be accepting very well-reviewed and conservative changes at this point.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > ..., nobody wants corruption to be tolerated. ...

          This is not true.

          > When we came up with recovery mode, we very specifically made it optional, not something that was on by default. ...

          Mmm ... the edit log toleration default is also disabled. Don't you notice it?

          Show
          Tsz Wo Nicholas Sze added a comment - > ..., nobody wants corruption to be tolerated. ... This is not true. > When we came up with recovery mode, we very specifically made it optional, not something that was on by default. ... Mmm ... the edit log toleration default is also disabled. Don't you notice it?
          Hide
          Aaron T. Myers added a comment -

          Hey Nicholas, I'm still not entirely convinced of the motivation for this change, and I agree with Colin that we should be careful with respect to what impact this change will have on recovery mode. Given that, I do think we should revert this commit until everyone's satisfied with the changes it makes.

          I'm sure we can figure out a mutually-agreeable way to address the concerns you have, but let's talk it out for a bit first?

          Show
          Aaron T. Myers added a comment - Hey Nicholas, I'm still not entirely convinced of the motivation for this change, and I agree with Colin that we should be careful with respect to what impact this change will have on recovery mode. Given that, I do think we should revert this commit until everyone's satisfied with the changes it makes. I'm sure we can figure out a mutually-agreeable way to address the concerns you have, but let's talk it out for a bit first?
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Hi Aaron,

          Recovery mode in branch-1 just let you stop reading or quit without saving. It does not really "recover" anything. IMO, it does not seem useful. How about we revert it?

          Again, the edit log toleration can be disabled. I think there is no reason to revert the patch.

          If you see that other improvements could be made regarding to recovery mode or edit log toleration, we could discuss them in another JIRA.

          Show
          Tsz Wo Nicholas Sze added a comment - Hi Aaron, Recovery mode in branch-1 just let you stop reading or quit without saving. It does not really "recover" anything. IMO, it does not seem useful. How about we revert it? Again, the edit log toleration can be disabled. I think there is no reason to revert the patch. If you see that other improvements could be made regarding to recovery mode or edit log toleration, we could discuss them in another JIRA.
          Hide
          Aaron T. Myers added a comment -

          Recovery mode in branch-1 just let you stop reading or quit without saving. It does not really "recover" anything. IMO, it does not seem useful. How about we revert it?

          I find the option to stop reading exceptionally helpful, since we know that under certain circumstances the NN will end up with a partial final transaction at the end of the edit log, which will prevent the NN from running. I've already used it to recover a customer's corrupted edits file that had a partial final transaction at the end of it because a disk on the NN had filled up.

          Again, the edit log toleration can be disabled. I think there is no reason to revert the patch.

          Clearly not everyone agrees with the approach this patch took, myself included. I'm still not even convinced that the option this patch introduces is an improvement, even if it can be disabled. Let's please try to find a way to address whatever concerns you have that's agreeable to everyone.

          Show
          Aaron T. Myers added a comment - Recovery mode in branch-1 just let you stop reading or quit without saving. It does not really "recover" anything. IMO, it does not seem useful. How about we revert it? I find the option to stop reading exceptionally helpful, since we know that under certain circumstances the NN will end up with a partial final transaction at the end of the edit log, which will prevent the NN from running. I've already used it to recover a customer's corrupted edits file that had a partial final transaction at the end of it because a disk on the NN had filled up. Again, the edit log toleration can be disabled. I think there is no reason to revert the patch. Clearly not everyone agrees with the approach this patch took, myself included. I'm still not even convinced that the option this patch introduces is an improvement, even if it can be disabled. Let's please try to find a way to address whatever concerns you have that's agreeable to everyone.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > I find the option to stop reading exceptionally helpful, ...

          Edit log toleration is even better since it allows you to specify the toleration length. If there is a corruption and the corruption length is within toleration length, the namenode could start up as usual.

          Show
          Tsz Wo Nicholas Sze added a comment - > I find the option to stop reading exceptionally helpful, ... Edit log toleration is even better since it allows you to specify the toleration length. If there is a corruption and the corruption length is within toleration length, the namenode could start up as usual.
          Hide
          Suresh Srinivas added a comment - - edited

          Given that, I do think we should revert this commit until everyone's satisfied with the changes it makes.

          I am not sure if that is necessary. I have closely worked with Nicholas on this and we tested this quite heavily. Hence +1 from my side.

          If there are valid concerns about any bugs in the functionality, lets open another jira; I think this functionality is very useful.

          I find the option to stop reading exceptionally helpful

          The patch adds an option to turn off the functionality for this reason. Does that not work for you?

          Show
          Suresh Srinivas added a comment - - edited Given that, I do think we should revert this commit until everyone's satisfied with the changes it makes. I am not sure if that is necessary. I have closely worked with Nicholas on this and we tested this quite heavily. Hence +1 from my side. If there are valid concerns about any bugs in the functionality, lets open another jira; I think this functionality is very useful. I find the option to stop reading exceptionally helpful The patch adds an option to turn off the functionality for this reason. Does that not work for you?
          Hide
          Colin Patrick McCabe added a comment -

          Can we meet and talk about this in person? I think we all have the same goals here, but we need to make sure that we're all on the same page. It's too bad that the HDFS meetup was yesterday rather than later this week, but perhaps we could meet tomorrow or something?

          Show
          Colin Patrick McCabe added a comment - Can we meet and talk about this in person? I think we all have the same goals here, but we need to make sure that we're all on the same page. It's too bad that the HDFS meetup was yesterday rather than later this week, but perhaps we could meet tomorrow or something?
          Hide
          Suresh Srinivas added a comment -

          Can we meet and talk about this in person?

          Certainly we could do that, but is it really necessary? I must be really missing some thing - I see this as straight forward and important change.

          Show
          Suresh Srinivas added a comment - Can we meet and talk about this in person? Certainly we could do that, but is it really necessary? I must be really missing some thing - I see this as straight forward and important change.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > Can we meet and talk about this in person? ...

          Let's keep our discussion on JIRA so that other people could comment on it. I have filed HDFS-3540 for this purpose.

          Show
          Tsz Wo Nicholas Sze added a comment - > Can we meet and talk about this in person? ... Let's keep our discussion on JIRA so that other people could comment on it. I have filed HDFS-3540 for this purpose.
          Hide
          Colin Patrick McCabe added a comment -

          Let's keep our discussion on JIRA so that other people could comment on it. I have filed HDFS-3540 for this purpose.

          Ok. Since I noticed that there are a bunch of people watching this JIRA, but nobody watching HDFS-3540 (yet?), I hope you don't mind if I repost this comment twice. I do apologize for the potential inbox spam.

          I think rather than tolerating a configurable amount of edit log corruption, we should just never tolerate any corruption.

          Basically, what I am saying is that UNCHECKED_REGION_LENGTH was a mistake. It was my mistake, and I don't blame you for trying to fix it. However, the current fix still assumes that there is someone who would want UNCHECKED_REGION_LENGTH. I do not think any people like that exist!

          Does that make sense? I can post a patch if that would be helpful.

          Show
          Colin Patrick McCabe added a comment - Let's keep our discussion on JIRA so that other people could comment on it. I have filed HDFS-3540 for this purpose. Ok. Since I noticed that there are a bunch of people watching this JIRA, but nobody watching HDFS-3540 (yet?), I hope you don't mind if I repost this comment twice. I do apologize for the potential inbox spam. I think rather than tolerating a configurable amount of edit log corruption, we should just never tolerate any corruption. Basically, what I am saying is that UNCHECKED_REGION_LENGTH was a mistake. It was my mistake, and I don't blame you for trying to fix it. However, the current fix still assumes that there is someone who would want UNCHECKED_REGION_LENGTH. I do not think any people like that exist! Does that make sense? I can post a patch if that would be helpful.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > Ok. Since I noticed that there are a bunch of people watching this JIRA, but nobody watching HDFS-3540 (yet?), I hope you don't mind if I repost this comment twice. I do apologize for the potential inbox spam.

          HDFS-3540 just has be created. If people are interested in the further discussion, they will watch it. If they don't watch the new JIRA and they are not very interested in it, we cannot force them. I indeed mind if anyone keeps posting the same comment on two different JIRAs.

          > ... I do not think any people like that exist!

          As mentioned by Suresh, edit log toleration is a very useful.

          Show
          Tsz Wo Nicholas Sze added a comment - > Ok. Since I noticed that there are a bunch of people watching this JIRA, but nobody watching HDFS-3540 (yet?), I hope you don't mind if I repost this comment twice. I do apologize for the potential inbox spam. HDFS-3540 just has be created. If people are interested in the further discussion, they will watch it. If they don't watch the new JIRA and they are not very interested in it, we cannot force them. I indeed mind if anyone keeps posting the same comment on two different JIRAs. > ... I do not think any people like that exist! As mentioned by Suresh, edit log toleration is a very useful.
          Hide
          Luke Lu added a comment -

          Is this targeted for 1.1? Looks like it's only in branch-1 and not branch-1.1.

          Show
          Luke Lu added a comment - Is this targeted for 1.1? Looks like it's only in branch-1 and not branch-1.1.
          Hide
          Matt Foley added a comment -

          Closed upon release of Hadoop 1.2.0.

          Show
          Matt Foley added a comment - Closed upon release of Hadoop 1.2.0.

            People

            • Assignee:
              Tsz Wo Nicholas Sze
              Reporter:
              Tsz Wo Nicholas Sze
            • Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development