Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-3479

backport HDFS-3335 (check for edit log corruption at the end of the log) to branch-1

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.0.0
    • Fix Version/s: 1.2.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      backport HDFS-3335 (check for edit log corruption at the end of the log) to branch-1

      1. HDFS-3479-b1.003.patch
        10 kB
        Colin Patrick McCabe
      2. HDFS-3479-b1.002.patch
        10 kB
        Colin Patrick McCabe
      3. HDFS-3335-b1.005.patch
        11 kB
        Colin Patrick McCabe

        Issue Links

          Activity

          Hide
          Colin Patrick McCabe added a comment -
          • fix the way we do pre-allocation (formerly, we were creating sparse files in branch-1, not actually reserving the space we thought we were.)
          • an OP_INVALID is now an error unless:
            1. it appears in the last 2 MB of the file
            or
            2. it is followed only by 0x0 and 0xff bytes, until the end of the file.
          Show
          Colin Patrick McCabe added a comment - fix the way we do pre-allocation (formerly, we were creating sparse files in branch-1, not actually reserving the space we thought we were.) an OP_INVALID is now an error unless: 1. it appears in the last 2 MB of the file or 2. it is followed only by 0x0 and 0xff bytes, until the end of the file.
          Hide
          Colin Patrick McCabe added a comment -

          backport

          Show
          Colin Patrick McCabe added a comment - backport
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12530264/HDFS-3335-b1.005.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2548//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12530264/HDFS-3335-b1.005.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 1 new or modified test files. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2548//console This message is automatically generated.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Why there is a UNCHECKED_REGION_LENGTH?

          Show
          Tsz Wo Nicholas Sze added a comment - Why there is a UNCHECKED_REGION_LENGTH?
          Hide
          Colin Patrick McCabe added a comment -

          Hi Nicholas,

          In the patch it says "We don't check the last two megabytes of the edit log, in case the NameNode crashed while writing to the edit log."

          Basically, if we crash while writing to the end of the log, the underlying filesystem does not give us the guarantees we would need to get every byte perfect. Consider the following sequence of events:

          1. NN allocates an extra 1 MB at the end of the file and fills it with 0xff bytes
          2. NN writes an opcode to the edit log file. It happens to span two sectors on the hard disk
          3. The kernel writes the second half of the opcode to disk
          4. system crash

          In this case, we're left with a file that looks like this:

          0xff 0xff 0xff 0xff ... [opcode bytes]... 0xff 0xff 0xff
          

          This would clearly fail validation. Hence the NameNode would fail to start, even though no data has been lost (the opcode was never acked to the client). This would be a serious problem. UNCHECKED_REGION_LENGTH fixes this problem.

          We can't control the order in which the kernel flushes sectors out of the buffer cache and on to the hard disk. We can set up barriers (that is what fsync is), but control of the ordering is beyond us.

          Show
          Colin Patrick McCabe added a comment - Hi Nicholas, In the patch it says "We don't check the last two megabytes of the edit log, in case the NameNode crashed while writing to the edit log." Basically, if we crash while writing to the end of the log, the underlying filesystem does not give us the guarantees we would need to get every byte perfect. Consider the following sequence of events: 1. NN allocates an extra 1 MB at the end of the file and fills it with 0xff bytes 2. NN writes an opcode to the edit log file. It happens to span two sectors on the hard disk 3. The kernel writes the second half of the opcode to disk 4. system crash In this case, we're left with a file that looks like this: 0xff 0xff 0xff 0xff ... [opcode bytes]... 0xff 0xff 0xff This would clearly fail validation. Hence the NameNode would fail to start, even though no data has been lost (the opcode was never acked to the client). This would be a serious problem. UNCHECKED_REGION_LENGTH fixes this problem. We can't control the order in which the kernel flushes sectors out of the buffer cache and on to the hard disk. We can set up barriers (that is what fsync is), but control of the ordering is beyond us.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Why UNCHECKED_REGION_LENGTH is 2MB but not the preallocation length (1MB)?

          Show
          Tsz Wo Nicholas Sze added a comment - Why UNCHECKED_REGION_LENGTH is 2MB but not the preallocation length (1MB)?
          Hide
          Colin Patrick McCabe added a comment -

          That's a good point. It should be 1MB, since we're never writing more than that to disk in any single call to FSEditLog#flushAndSync in branch-1.

          Show
          Colin Patrick McCabe added a comment - That's a good point. It should be 1MB, since we're never writing more than that to disk in any single call to FSEditLog#flushAndSync in branch-1.
          Hide
          Colin Patrick McCabe added a comment -
          • reduce the unchecked region to 1MB.
          • add PREALLOCATION_LENGTH constant rather than hard-coding
          Show
          Colin Patrick McCabe added a comment - reduce the unchecked region to 1MB. add PREALLOCATION_LENGTH constant rather than hard-coding
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12530479/HDFS-3479-b1.002.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2561//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12530479/HDFS-3479-b1.002.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 1 new or modified test files. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2561//console This message is automatically generated.
          Hide
          Todd Lipcon added a comment -
          +      int numRead = in.read(buf, 0, (int)amt);
          +      if (numRead < amt) {
          

          read isn't guaranteed to return all bytes requested, even if there are more bytes in the file. So I don't think this check is necessary – instead just check numRead >= 0, I think.

          Otherwise looks good to me.

          Show
          Todd Lipcon added a comment - + int numRead = in.read(buf, 0, ( int )amt); + if (numRead < amt) { read isn't guaranteed to return all bytes requested, even if there are more bytes in the file. So I don't think this check is necessary – instead just check numRead >= 0, I think. Otherwise looks good to me.
          Hide
          Colin Patrick McCabe added a comment -

          Eek, looks like you're right. From DataInputStream#read(byte b[], int off, int len) throws IOException:

          * Reads up to <code>len</code> bytes of data from the contained 
          * input stream into an array of bytes.  An attempt is made to read 
          * as many as <code>len</code> bytes, but a smaller number may be read, 
          * possibly zero. The number of bytes actually read is returned as an 
          * integer.
          
          Show
          Colin Patrick McCabe added a comment - Eek, looks like you're right. From DataInputStream#read(byte b[], int off, int len) throws IOException: * Reads up to <code>len</code> bytes of data from the contained * input stream into an array of bytes. An attempt is made to read * as many as <code>len</code> bytes, but a smaller number may be read, * possibly zero. The number of bytes actually read is returned as an * integer.
          Hide
          Colin Patrick McCabe added a comment -
          • verifyEndOfLog: handle short reads
          Show
          Colin Patrick McCabe added a comment - verifyEndOfLog: handle short reads
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12530615/HDFS-3479-b1.003.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2574//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12530615/HDFS-3479-b1.003.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 1 new or modified test files. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2574//console This message is automatically generated.
          Hide
          Colin Patrick McCabe added a comment -

          ran tests TestCheckpoint \
          TestEditLog \
          TestNameNodeRecovery \
          TestEditLogLoading \
          TestNameNodeMXBean \
          TestSaveNamespace \
          TestSecurityTokenEditLog \
          TestStorageDirectoryFailure \
          TestStorageRestore

          Show
          Colin Patrick McCabe added a comment - ran tests TestCheckpoint \ TestEditLog \ TestNameNodeRecovery \ TestEditLogLoading \ TestNameNodeMXBean \ TestSaveNamespace \ TestSecurityTokenEditLog \ TestStorageDirectoryFailure \ TestStorageRestore
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Please run all the tests and also test-patch. That are what Jenkins run.

          Show
          Tsz Wo Nicholas Sze added a comment - Please run all the tests and also test-patch. That are what Jenkins run.
          Hide
          Colin Patrick McCabe added a comment -

          I ran all the tests, and also test-patch. There were no issues. We should be ready to commit, thanks everyone.

          Show
          Colin Patrick McCabe added a comment - I ran all the tests, and also test-patch. There were no issues. We should be ready to commit, thanks everyone.
          Hide
          Todd Lipcon added a comment -

          Patch looks good to me. +1. Any further comments, Nicholas? If not, I'll commit later today.

          Show
          Todd Lipcon added a comment - Patch looks good to me. +1. Any further comments, Nicholas? If not, I'll commit later today.
          Hide
          Tsz Wo Nicholas Sze added a comment -
          +        int written = fc.write(fill, position);
          

          written could be less than fill.capacity(). Is it a problem?

          Show
          Tsz Wo Nicholas Sze added a comment - + int written = fc.write(fill, position); written could be less than fill.capacity(). Is it a problem?
          Hide
          Colin Patrick McCabe added a comment -

          written could be less than fill.capacity(). Is it a problem?

          I have an upcoming change, HDFS-3510, which fixes this. I think it makes sense to submit this patch as-is, and work on the FileChannel#write issues separately. It should make reviews easier and quicker than combining the changes. Makes sense?

          Show
          Colin Patrick McCabe added a comment - written could be less than fill.capacity(). Is it a problem? I have an upcoming change, HDFS-3510 , which fixes this. I think it makes sense to submit this patch as-is, and work on the FileChannel#write issues separately. It should make reviews easier and quicker than combining the changes. Makes sense?
          Hide
          Tsz Wo Nicholas Sze added a comment -

          I am okay if you are going to fix it in HDFS-3510. Thanks.

          Show
          Tsz Wo Nicholas Sze added a comment - I am okay if you are going to fix it in HDFS-3510 . Thanks.
          Hide
          Eli Collins added a comment -

          I've committed this to branch-1. Thanks Colin.

          Show
          Eli Collins added a comment - I've committed this to branch-1. Thanks Colin.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          One more question: why the trunk patch do not have the UNCHECKED_REGION_LENGTH stuff?

          Show
          Tsz Wo Nicholas Sze added a comment - One more question: why the trunk patch do not have the UNCHECKED_REGION_LENGTH stuff?
          Hide
          Tsz Wo Nicholas Sze added a comment -

          BTW, I have created HDFS-3521 for further improvement.

          Show
          Tsz Wo Nicholas Sze added a comment - BTW, I have created HDFS-3521 for further improvement.
          Hide
          Colin Patrick McCabe added a comment -

          One more question: why the trunk patch do not have the UNCHECKED_REGION_LENGTH stuff?

          trunk handles this differently, using recovery mode to ensure that we don't stop reading the edit log early. This is possible because recovery mode is a lot more advanced in trunk than in branch-1.

          Show
          Colin Patrick McCabe added a comment - One more question: why the trunk patch do not have the UNCHECKED_REGION_LENGTH stuff? trunk handles this differently, using recovery mode to ensure that we don't stop reading the edit log early. This is possible because recovery mode is a lot more advanced in trunk than in branch-1.
          Hide
          Matt Foley added a comment -

          Closed upon release of Hadoop 1.2.0.

          Show
          Matt Foley added a comment - Closed upon release of Hadoop 1.2.0.

            People

            • Assignee:
              Colin Patrick McCabe
              Reporter:
              Colin Patrick McCabe
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development