Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-679

Appending to a partial chunk incorrectly assumes the first packet fills up the partial chunk

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.21.0
    • Fix Version/s: 0.21.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      When I run TestClientProtocolForPipelineRecovery, I always see that the block receiver throws IOException complaining about mismatched checksum when receiving the last data packet. It turns out the checksum of last packet was unexpectedly set to be zero.

      1. appendToPartialChunk.patch
        14 kB
        Hairong Kuang
      2. appendToPartialChunk1.patch
        20 kB
        Hairong Kuang

        Issue Links

          Activity

          Hide
          Hairong Kuang added a comment -

          The problem is caused by incorrect handling of checksum when appending to a partial chunk. Specifically the following code in DFSClient#DFSOutputStream#writeChunk:

                  // If this was the first write after reopening a file, then the above
                  // write filled up any partial chunk. Tell the summer to generate full 
                  // crc chunks from now on.
                  if (appendChunk) {
                    appendChunk = false;
                    resetChecksumChunk(bytesPerChecksum);
                  }
          

          The code blindly assumes that the first packet after reopening a file for append fills up the partial chunk. But in case of hflush, a packet may be sent before the partial chunk is filled up. So resetting the checksum calculator makes the next packet to be sent with a wrong checkusm.

          Show
          Hairong Kuang added a comment - The problem is caused by incorrect handling of checksum when appending to a partial chunk. Specifically the following code in DFSClient#DFSOutputStream#writeChunk: // If this was the first write after reopening a file, then the above // write filled up any partial chunk. Tell the summer to generate full // crc chunks from now on. if (appendChunk) { appendChunk = false; resetChecksumChunk(bytesPerChecksum); } The code blindly assumes that the first packet after reopening a file for append fills up the partial chunk. But in case of hflush, a packet may be sent before the partial chunk is filled up. So resetting the checksum calculator makes the next packet to be sent with a wrong checkusm.
          Hide
          Hairong Kuang added a comment -

          It turns out there is also a bug at the DataNode side that handles appending to partial checksum chunk incorrectly. Let me try to explain.

          Suppose that a file has one byte 'a'. It is open for append. The client writes one byte 'b' and calls hflush. A packet containing 'b" gets pushed to a datanode. The datanode reads 'a' and its checksum from disk, validating its on-disk data, then recaculates the checksum for "ab", writes 'b' to disk, and overwrites the checksum for the chunk. So far so good. However, when the client writes another byte "c" and calls hflush again, the client send a packet "bc" not "c" to the datanode side. When the datanode reads "a" and checksums from the disk, checksum validation fails because the checksum on disk is not the checksum for "a" but for "ab". The write fails at the following lines in BlockReceiver#computePartialChunkCrc:

              if (partialCrc.getValue() != FSInputChecker.checksum2long(crcbuf)) {
                String msg = "Partial CRC " + partialCrc.getValue() +
                             " does not match value computed the " +
                             " last time file was closed " +
                             FSInputChecker.checksum2long(crcbuf);
                throw new IOException(msg);
              }
          
          Show
          Hairong Kuang added a comment - It turns out there is also a bug at the DataNode side that handles appending to partial checksum chunk incorrectly. Let me try to explain. Suppose that a file has one byte 'a'. It is open for append. The client writes one byte 'b' and calls hflush. A packet containing 'b" gets pushed to a datanode. The datanode reads 'a' and its checksum from disk, validating its on-disk data, then recaculates the checksum for "ab", writes 'b' to disk, and overwrites the checksum for the chunk. So far so good. However, when the client writes another byte "c" and calls hflush again, the client send a packet "bc" not "c" to the datanode side. When the datanode reads "a" and checksums from the disk, checksum validation fails because the checksum on disk is not the checksum for "a" but for "ab". The write fails at the following lines in BlockReceiver#computePartialChunkCrc: if (partialCrc.getValue() != FSInputChecker.checksum2long(crcbuf)) { String msg = "Partial CRC " + partialCrc.getValue() + " does not match value computed the " + " last time file was closed " + FSInputChecker.checksum2long(crcbuf); throw new IOException(msg); }
          Hide
          Hairong Kuang added a comment -

          I'd like to propose a solution to the appending to a partial checksum chunk problem. The client reads in the partial chunk into its buffer before any write and they will be part of the packets sent to the datanode side before the chunk gets filled up. This solution maintains a good property: any packet sent to datanodes starts at a checksum chunk boundary. This would not increase complexity at the client side but would completely eliminate the handling of appending to partial chunk at the datanode side.

          Another optimization to a block receiver at the datanode is not to overwrite block file if a packet starts with data that the replica file already has. For the crc file, only the last 4 bytes are allowed to be overwritten.

          Show
          Hairong Kuang added a comment - I'd like to propose a solution to the appending to a partial checksum chunk problem. The client reads in the partial chunk into its buffer before any write and they will be part of the packets sent to the datanode side before the chunk gets filled up. This solution maintains a good property: any packet sent to datanodes starts at a checksum chunk boundary. This would not increase complexity at the client side but would completely eliminate the handling of appending to partial chunk at the datanode side. Another optimization to a block receiver at the datanode is not to overwrite block file if a packet starts with data that the replica file already has. For the crc file, only the last 4 bytes are allowed to be overwritten.
          Hide
          Hairong Kuang added a comment -

          The proposal in above comment is a simple solution, but it has an unresolvable flaw. The client needs READ permission to read the partial chunk which a writer does not require to have.

          So instead of using a different solution, I focus on making the existing solution to work. Here is the plan:
          1. The client does not set appendChunk to be false until the first chunk has been sent;
          2. If datanode receives a request to append "bcd" to a partial chunk "a" but "bc" have already been written to disk in a previous hflush, the datanode will read "abc" from disk and computes the checksum of "abcd" and then write "c" and the new checksum to the disk. In the current trunk and 0.21, the datanode mistakenly computes the crc of "abcbcd".

          I will also implement the proposed optimization: a block receiver at the datanode is not to overwrite block file if a packet starts with data that the replica file already has. For the crc file, only the last 4 bytes are allowed to be overwritten. This optimization makes it easy to implement datanode side fix.

          Show
          Hairong Kuang added a comment - The proposal in above comment is a simple solution, but it has an unresolvable flaw. The client needs READ permission to read the partial chunk which a writer does not require to have. So instead of using a different solution, I focus on making the existing solution to work. Here is the plan: 1. The client does not set appendChunk to be false until the first chunk has been sent; 2. If datanode receives a request to append "bcd" to a partial chunk "a" but "bc" have already been written to disk in a previous hflush, the datanode will read "abc" from disk and computes the checksum of "abcd" and then write "c" and the new checksum to the disk. In the current trunk and 0.21, the datanode mistakenly computes the crc of "abcbcd". I will also implement the proposed optimization: a block receiver at the datanode is not to overwrite block file if a packet starts with data that the replica file already has. For the crc file, only the last 4 bytes are allowed to be overwritten. This optimization makes it easy to implement datanode side fix.
          Hide
          Hairong Kuang added a comment -

          A patch for review. In addition to above proposed fix, it has a unit test that tests partial chunk append.

          Show
          Hairong Kuang added a comment - A patch for review. In addition to above proposed fix, it has a unit test that tests partial chunk append.
          Hide
          Suresh Srinivas added a comment -
          1. Update the comments in DFSClient.java to say If this was the first write after reopening a file, if the above write filled up any partial chunk...
          2. BlockReceiver.receivePacket() - calculation of offsetInChecksum - you may have intended to put parenthesis around an expression and not just onDiskLen.
          3. BlockReceiver.receivePacket() - is reading and checking pre-existing checksum necessary? What is the impact of this on performance with hflush?
          4. TestFileAppend2.testAppendToPartialChunk - add brief description to the test
          5. TestFileAppend2.testAppendToPartialChunk - It would be good if the write tests not only partial chunk as it does now, but also a write that completes a previous partial chunk and then writes partial chunk after that.
          Show
          Suresh Srinivas added a comment - Update the comments in DFSClient.java to say If this was the first write after reopening a file, if the above write filled up any partial chunk... BlockReceiver.receivePacket() - calculation of offsetInChecksum - you may have intended to put parenthesis around an expression and not just onDiskLen. BlockReceiver.receivePacket() - is reading and checking pre-existing checksum necessary? What is the impact of this on performance with hflush? TestFileAppend2.testAppendToPartialChunk - add brief description to the test TestFileAppend2.testAppendToPartialChunk - It would be good if the write tests not only partial chunk as it does now, but also a write that completes a previous partial chunk and then writes partial chunk after that.
          Hide
          Hairong Kuang added a comment -

          This patch incorporates comments 1, 2, 4, and 5. Comment 5 is of special help that help find another bug in the code. Thanks Suresh!

          As for comment 3, this does not effect every hflush operation. This takes effect only for the first partial chunk after an application opened a file for append and is good for capturing on-disk data corruption.

          Show
          Hairong Kuang added a comment - This patch incorporates comments 1, 2, 4, and 5. Comment 5 is of special help that help find another bug in the code. Thanks Suresh! As for comment 3, this does not effect every hflush operation. This takes effect only for the first partial chunk after an application opened a file for append and is good for capturing on-disk data corruption.
          Hide
          Suresh Srinivas added a comment -

          +1 for the new patch.

          Show
          Suresh Srinivas added a comment - +1 for the new patch.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12422640/appendToPartialChunk1.patch
          against trunk revision 826905.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 9 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/42/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/42/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/42/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/42/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12422640/appendToPartialChunk1.patch against trunk revision 826905. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/42/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/42/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/42/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/42/console This message is automatically generated.
          Hide
          Hudson added a comment -

          Integrated in Hdfs-Patch-h2.grid.sp2.yahoo.net #47 (See http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/47/)
          . Appending to a partial chunk incorrectly assumes the first packet fills up the partial chunk. Contributed by Hairong Kuang.

          Show
          Hudson added a comment - Integrated in Hdfs-Patch-h2.grid.sp2.yahoo.net #47 (See http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/47/ ) . Appending to a partial chunk incorrectly assumes the first packet fills up the partial chunk. Contributed by Hairong Kuang.
          Hide
          Hairong Kuang added a comment -

          The failed core test was caused by HDFS-690.

          I've committed this.

          Show
          Hairong Kuang added a comment - The failed core test was caused by HDFS-690 . I've committed this.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk-Commit #79 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/79/)

          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #79 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/79/ )
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #120 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk/120/)

          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #120 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk/120/ )
          Hide
          Hudson added a comment -

          Integrated in Hdfs-Patch-h5.grid.sp2.yahoo.net #78 (See http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/78/)

          Show
          Hudson added a comment - Integrated in Hdfs-Patch-h5.grid.sp2.yahoo.net #78 (See http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/78/ )
          Hide
          Gokul added a comment -

          This issue is there in 0.20 append branch also.Is there a patch for 0.20 append?

          Show
          Gokul added a comment - This issue is there in 0.20 append branch also.Is there a patch for 0.20 append?
          Hide
          luoli added a comment -

          Hairong, I noticed that this problem exists in the the 0.20 and 0.20.205, is there a patch for them? because the patch in this jira is base on 0.21 and our version is so different to that version.

          Show
          luoli added a comment - Hairong, I noticed that this problem exists in the the 0.20 and 0.20.205, is there a patch for them? because the patch in this jira is base on 0.21 and our version is so different to that version.

            People

            • Assignee:
              Hairong Kuang
              Reporter:
              Hairong Kuang
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development