Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-3721

hsync support broke wire compatibility

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 2.0.0-alpha
    • Fix Version/s: 2.0.2-alpha
    • Component/s: datanode, hdfs-client
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      HDFS-744 added support for hsync to the data transfer wire protocol. However, it actually broke wire compatibility: if the client has hsync support but the server does not, the client cannot read or write data on the old cluster.

      1. hdfs-3721.txt
        53 kB
        Todd Lipcon
      2. hdfs-3721.txt
        58 kB
        Aaron T. Myers
      3. hdfs-3721.txt
        58 kB
        Aaron T. Myers

        Issue Links

          Activity

          Hide
          Hudson added a comment -

          Integrated in Hadoop-Common-trunk-Commit #2569 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/2569/)
          HDFS-3721. hsync support broke wire compatibility. Contributed by Todd Lipcon and Aaron T. Myers. (Revision 1371495)

          Result = SUCCESS
          atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1371495
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/RemoteBlockReader2.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/datatransfer/PacketHeader.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/datatransfer/PacketReceiver.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockReceiver.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockSender.java
          Show
          Hudson added a comment - Integrated in Hadoop-Common-trunk-Commit #2569 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/2569/ ) HDFS-3721 . hsync support broke wire compatibility. Contributed by Todd Lipcon and Aaron T. Myers. (Revision 1371495) Result = SUCCESS atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1371495 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/RemoteBlockReader2.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/datatransfer/PacketHeader.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/datatransfer/PacketReceiver.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockReceiver.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockSender.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk-Commit #2634 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/2634/)
          HDFS-3721. hsync support broke wire compatibility. Contributed by Todd Lipcon and Aaron T. Myers. (Revision 1371495)

          Result = SUCCESS
          atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1371495
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/RemoteBlockReader2.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/datatransfer/PacketHeader.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/datatransfer/PacketReceiver.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockReceiver.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockSender.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #2634 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/2634/ ) HDFS-3721 . hsync support broke wire compatibility. Contributed by Todd Lipcon and Aaron T. Myers. (Revision 1371495) Result = SUCCESS atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1371495 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/RemoteBlockReader2.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/datatransfer/PacketHeader.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/datatransfer/PacketReceiver.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockReceiver.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockSender.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk-Commit #2589 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/2589/)
          HDFS-3721. hsync support broke wire compatibility. Contributed by Todd Lipcon and Aaron T. Myers. (Revision 1371495)

          Result = FAILURE
          atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1371495
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/RemoteBlockReader2.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/datatransfer/PacketHeader.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/datatransfer/PacketReceiver.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockReceiver.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockSender.java
          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #2589 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/2589/ ) HDFS-3721 . hsync support broke wire compatibility. Contributed by Todd Lipcon and Aaron T. Myers. (Revision 1371495) Result = FAILURE atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1371495 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/RemoteBlockReader2.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/datatransfer/PacketHeader.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/datatransfer/PacketReceiver.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockReceiver.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockSender.java
          Hide
          Aaron T. Myers added a comment -

          I've just committed this to trunk and branch-2.

          Thanks a lot for the contribution, Todd, and thanks to everyone else for the discussion of this issue.

          Show
          Aaron T. Myers added a comment - I've just committed this to trunk and branch-2. Thanks a lot for the contribution, Todd, and thanks to everyone else for the discussion of this issue.
          Hide
          Aaron T. Myers added a comment -

          Here's a patch that's rebased on trunk.

          Since I haven't heard anything for a few days, I'm going to go ahead and commit this.

          Show
          Aaron T. Myers added a comment - Here's a patch that's rebased on trunk. Since I haven't heard anything for a few days, I'm going to go ahead and commit this.
          Hide
          Aaron T. Myers added a comment -

          Thanks a lot for verifying that this patch introduces no performance regressions, Todd.

          Since I've reviewed Todd's original patch and addressed Suresh's feedback, and since Todd has reviewed the functional difference between his original patch and the one I posted, I'd like to go ahead and commit this patch. Unless I hear something in the next day or so, I'm going to go ahead and commit this, so please post your feedback promptly if you intend to review it.

          Show
          Aaron T. Myers added a comment - Thanks a lot for verifying that this patch introduces no performance regressions, Todd. Since I've reviewed Todd's original patch and addressed Suresh's feedback, and since Todd has reviewed the functional difference between his original patch and the one I posted, I'd like to go ahead and commit this patch. Unless I hear something in the next day or so, I'm going to go ahead and commit this, so please post your feedback promptly if you intend to review it.
          Hide
          Todd Lipcon added a comment -

          +1 to your fixups of my patch. Thanks Aaron for taking this over.

          I also wanted to double-check the performance, so I made some modifications to TestMultithreadedHFlush to calculate latency percentiles. Here are the results I got:

          Before patch (trunk):
          
          Latency quantiles (in microseconds):
          50.00 %ile +/- 5.00%: 531
          75.00 %ile +/- 2.50%: 669
          90.00 %ile +/- 1.00%: 960
          95.00 %ile +/- 0.50%: 1273
          99.00 %ile +/- 0.10%: 2822
          
          My earlier broken patch:
          Latency quantiles (in microseconds):
          50.00 %ile +/- 5.00%: 40090
          75.00 %ile +/- 2.50%: 40921
          90.00 %ile +/- 1.00%: 42837
          95.00 %ile +/- 0.50%: 44027
          99.00 %ile +/- 0.10%: 49937
          
          Aaron's patch:
          Latency quantiles (in microseconds):
          50.00 %ile +/- 5.00%: 556
          75.00 %ile +/- 2.50%: 697
          90.00 %ile +/- 1.00%: 982
          95.00 %ile +/- 0.50%: 1303
          99.00 %ile +/- 0.10%: 2789
          

          In other words, the performance after this patch is comparable to the performance before, and indeed the performance of my earlier patch was a mess due to the 40ms nagling delay. This patch is also going to help performance in the future since it gets us a lot closer to the direct-buf based write path by simplifying the code.

          I'll upload the modifications to TestMultithreadedHFlush as a new JIRA, since it was useful in verifying this performance.

          Show
          Todd Lipcon added a comment - +1 to your fixups of my patch. Thanks Aaron for taking this over. I also wanted to double-check the performance, so I made some modifications to TestMultithreadedHFlush to calculate latency percentiles. Here are the results I got: Before patch (trunk): Latency quantiles (in microseconds): 50.00 %ile +/- 5.00%: 531 75.00 %ile +/- 2.50%: 669 90.00 %ile +/- 1.00%: 960 95.00 %ile +/- 0.50%: 1273 99.00 %ile +/- 0.10%: 2822 My earlier broken patch: Latency quantiles (in microseconds): 50.00 %ile +/- 5.00%: 40090 75.00 %ile +/- 2.50%: 40921 90.00 %ile +/- 1.00%: 42837 95.00 %ile +/- 0.50%: 44027 99.00 %ile +/- 0.10%: 49937 Aaron's patch: Latency quantiles (in microseconds): 50.00 %ile +/- 5.00%: 556 75.00 %ile +/- 2.50%: 697 90.00 %ile +/- 1.00%: 982 95.00 %ile +/- 0.50%: 1303 99.00 %ile +/- 0.10%: 2789 In other words, the performance after this patch is comparable to the performance before, and indeed the performance of my earlier patch was a mess due to the 40ms nagling delay. This patch is also going to help performance in the future since it gets us a lot closer to the direct-buf based write path by simplifying the code. I'll upload the modifications to TestMultithreadedHFlush as a new JIRA, since it was useful in verifying this performance.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12539159/hdfs-3721.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2956//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2956//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12539159/hdfs-3721.txt against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 javadoc. The javadoc tool did not generate any warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2956//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2956//console This message is automatically generated.
          Hide
          Aaron T. Myers added a comment -

          Todd's been busy with some other stuff so I'm going to take over working on this issue.

          Attached is an updated patch which addresses Suresh's and my feedback.

          During testing, I also discovered that Todd's original patch introduced some unfortunate interaction with Nagle's algorithm in TCP for small packets during writing. This was the cause of the TestFileConcurrentReader failure. So, in addition to addressing the feedback, this patch also changes DFSOutputStream.Packet#writeTo back to only ever call OutputStream#write once, by ensuring that the packet header, checksum data, and actual data are all in a single contiguous buffer, similarly to how it was done before this patch.

          The functional difference between this patch and the last is the following, to make review easier:

          commit 61fe911e60fe4db710813fc97a209456722ef1f7
          Author: Aaron T. Myers <atm@cloudera.com>
          Date:   Sat Aug 4 12:32:11 2012 -0700
          
              Send one buffer when writing to avoid nagling.
          
          diff --git hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java
          index e460a3c..4582712 100644
          --- hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java
          +++ hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java
          @@ -147,10 +147,17 @@ public class DFSOutputStream extends FSOutputSummer implements Syncable {
                * buf is pointed into like follows:
                *  (C is checksum data, D is payload data)
                *
          -     * [CCCCCCCCC________________DDDDDDDDDDDDDDDD___]
          -     *           ^               ^               ^
          -     *           checksumPos     dataStart       dataPos
          +     * [_________CCCCCCCCC________________DDDDDDDDDDDDDDDD___]
          +     *           ^        ^               ^               ^
          +     *           |        checksumPos     dataStart       dataPos
          +     *           checksumStart
          +     * 
          +     * Right before sending, we move the checksum data to immediately precede
          +     * the actual data, and then insert the header into the buffer immediately
          +     * preceding the checksum data, so we make sure to keep enough space in
          +     * front of the checksum data to support the largest conceivable header. 
                */
          +    int checksumStart;
               int checksumPos;
               int dataStart;
               int dataPos;
          @@ -166,9 +173,9 @@ public class DFSOutputStream extends FSOutputSummer implements Syncable {
                 this.offsetInBlock = 0;
                 this.seqno = HEART_BEAT_SEQNO;
                 
          -      buf = new byte[0];
          +      buf = new byte[PacketHeader.PKT_MAX_HEADER_LEN];
                 
          -      checksumPos = dataPos = dataStart = 0;
          +      checksumStart = checksumPos = dataPos = dataStart = PacketHeader.PKT_MAX_HEADER_LEN;
                 maxChunks = 0;
               }
               
          @@ -180,10 +187,11 @@ public class DFSOutputStream extends FSOutputSummer implements Syncable {
                 this.seqno = currentSeqno;
                 currentSeqno++;
                 
          -      buf = new byte[pktSize];
          +      buf = new byte[PacketHeader.PKT_MAX_HEADER_LEN + pktSize];
                 
          -      checksumPos = 0;
          -      dataStart = chunksPerPkt * checksum.getChecksumSize();
          +      checksumStart = PacketHeader.PKT_MAX_HEADER_LEN;
          +      checksumPos = checksumStart;
          +      dataStart = checksumStart + (chunksPerPkt * checksum.getChecksumSize());
                 dataPos = dataStart;
                 maxChunks = chunksPerPkt;
               }
          @@ -205,19 +213,38 @@ public class DFSOutputStream extends FSOutputSummer implements Syncable {
               }
               
               /**
          -     * Returns ByteBuffer that contains one full packet, including header.
          +     * Write the full packet, including the header, to the given output stream.
                */
               void writeTo(DataOutputStream stm) throws IOException {
          -      int dataLen = dataPos - dataStart;
          -      int checksumLen = checksumPos;
          -      int pktLen = HdfsConstants.BYTES_IN_INTEGER + dataLen + checksumLen;
          +      final int dataLen = dataPos - dataStart;
          +      final int checksumLen = checksumPos - checksumStart;
          +      final int pktLen = HdfsConstants.BYTES_IN_INTEGER + dataLen + checksumLen;
           
                 PacketHeader header = new PacketHeader(
                   pktLen, offsetInBlock, seqno, lastPacketInBlock, dataLen, syncBlock);
                 
          -      header.write(stm);
          -      stm.write(buf, 0, checksumLen);
          -      stm.write(buf, dataStart, dataLen);
          +      if (checksumPos != dataStart) {
          +        // Move the checksum to cover the gap. This can happen for the last
          +        // packet or during an hflush/hsync call.
          +        System.arraycopy(buf, checksumStart, buf, 
          +                         dataStart - checksumLen , checksumLen); 
          +        checksumPos = dataStart;
          +        checksumStart = checksumPos - checksumLen;
          +      }
          +      
          +      final int headerStart = checksumStart - header.getSerializedSize();
          +      assert checksumStart + 1 >= header.getSerializedSize();
          +      assert checksumPos == dataStart;
          +      assert headerStart >= 0;
          +      assert headerStart + header.getSerializedSize() == checksumStart;
          +      
          +      // Copy the header data into the buffer immediately preceding the checksum
          +      // data.
          +      System.arraycopy(header.getBytes(), 0, buf, headerStart,
          +          header.getSerializedSize());
          +      
          +      // Write the now contiguous full packet to the output stream.
          +      stm.write(buf, headerStart, header.getSerializedSize() + checksumLen + dataLen);
               }
               
               // get the packet's last byte's offset in the block
          diff --git hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/datatransfer/PacketHeader.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/datatransfer/PacketHeader.java
          index 1709101..dd56962 100644
          --- hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/datatransfer/PacketHeader.java
          +++ hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/datatransfer/PacketHeader.java
          @@ -166,6 +166,12 @@ public class PacketHeader {
               out.writeShort(proto.getSerializedSize());
               proto.writeTo(out);
             }
          +  
          +  public byte[] getBytes() {
          +    ByteBuffer buf = ByteBuffer.allocate(getSerializedSize());
          +    putInBuffer(buf);
          +    return buf.array();
          +  }
           
             /**
              * Perform a sanity check on the packet, returning true if it is sane.
          
          Show
          Aaron T. Myers added a comment - Todd's been busy with some other stuff so I'm going to take over working on this issue. Attached is an updated patch which addresses Suresh's and my feedback. During testing, I also discovered that Todd's original patch introduced some unfortunate interaction with Nagle's algorithm in TCP for small packets during writing. This was the cause of the TestFileConcurrentReader failure. So, in addition to addressing the feedback, this patch also changes DFSOutputStream.Packet#writeTo back to only ever call OutputStream#write once, by ensuring that the packet header, checksum data, and actual data are all in a single contiguous buffer, similarly to how it was done before this patch. The functional difference between this patch and the last is the following, to make review easier: commit 61fe911e60fe4db710813fc97a209456722ef1f7 Author: Aaron T. Myers <atm@cloudera.com> Date: Sat Aug 4 12:32:11 2012 -0700 Send one buffer when writing to avoid nagling. diff --git hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java index e460a3c..4582712 100644 --- hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java +++ hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java @@ -147,10 +147,17 @@ public class DFSOutputStream extends FSOutputSummer implements Syncable { * buf is pointed into like follows: * (C is checksum data, D is payload data) * - * [CCCCCCCCC________________DDDDDDDDDDDDDDDD___] - * ^ ^ ^ - * checksumPos dataStart dataPos + * [_________CCCCCCCCC________________DDDDDDDDDDDDDDDD___] + * ^ ^ ^ ^ + * | checksumPos dataStart dataPos + * checksumStart + * + * Right before sending, we move the checksum data to immediately precede + * the actual data, and then insert the header into the buffer immediately + * preceding the checksum data, so we make sure to keep enough space in + * front of the checksum data to support the largest conceivable header. */ + int checksumStart; int checksumPos; int dataStart; int dataPos; @@ -166,9 +173,9 @@ public class DFSOutputStream extends FSOutputSummer implements Syncable { this .offsetInBlock = 0; this .seqno = HEART_BEAT_SEQNO; - buf = new byte [0]; + buf = new byte [PacketHeader.PKT_MAX_HEADER_LEN]; - checksumPos = dataPos = dataStart = 0; + checksumStart = checksumPos = dataPos = dataStart = PacketHeader.PKT_MAX_HEADER_LEN; maxChunks = 0; } @@ -180,10 +187,11 @@ public class DFSOutputStream extends FSOutputSummer implements Syncable { this .seqno = currentSeqno; currentSeqno++; - buf = new byte [pktSize]; + buf = new byte [PacketHeader.PKT_MAX_HEADER_LEN + pktSize]; - checksumPos = 0; - dataStart = chunksPerPkt * checksum.getChecksumSize(); + checksumStart = PacketHeader.PKT_MAX_HEADER_LEN; + checksumPos = checksumStart; + dataStart = checksumStart + (chunksPerPkt * checksum.getChecksumSize()); dataPos = dataStart; maxChunks = chunksPerPkt; } @@ -205,19 +213,38 @@ public class DFSOutputStream extends FSOutputSummer implements Syncable { } /** - * Returns ByteBuffer that contains one full packet, including header. + * Write the full packet, including the header, to the given output stream. */ void writeTo(DataOutputStream stm) throws IOException { - int dataLen = dataPos - dataStart; - int checksumLen = checksumPos; - int pktLen = HdfsConstants.BYTES_IN_INTEGER + dataLen + checksumLen; + final int dataLen = dataPos - dataStart; + final int checksumLen = checksumPos - checksumStart; + final int pktLen = HdfsConstants.BYTES_IN_INTEGER + dataLen + checksumLen; PacketHeader header = new PacketHeader( pktLen, offsetInBlock, seqno, lastPacketInBlock, dataLen, syncBlock); - header.write(stm); - stm.write(buf, 0, checksumLen); - stm.write(buf, dataStart, dataLen); + if (checksumPos != dataStart) { + // Move the checksum to cover the gap. This can happen for the last + // packet or during an hflush/hsync call. + System .arraycopy(buf, checksumStart, buf, + dataStart - checksumLen , checksumLen); + checksumPos = dataStart; + checksumStart = checksumPos - checksumLen; + } + + final int headerStart = checksumStart - header.getSerializedSize(); + assert checksumStart + 1 >= header.getSerializedSize(); + assert checksumPos == dataStart; + assert headerStart >= 0; + assert headerStart + header.getSerializedSize() == checksumStart; + + // Copy the header data into the buffer immediately preceding the checksum + // data. + System .arraycopy(header.getBytes(), 0, buf, headerStart, + header.getSerializedSize()); + + // Write the now contiguous full packet to the output stream. + stm.write(buf, headerStart, header.getSerializedSize() + checksumLen + dataLen); } // get the packet's last byte 's offset in the block diff --git hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/datatransfer/PacketHeader.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/datatransfer/PacketHeader.java index 1709101..dd56962 100644 --- hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/datatransfer/PacketHeader.java +++ hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/datatransfer/PacketHeader.java @@ -166,6 +166,12 @@ public class PacketHeader { out.writeShort(proto.getSerializedSize()); proto.writeTo(out); } + + public byte [] getBytes() { + ByteBuffer buf = ByteBuffer.allocate(getSerializedSize()); + putInBuffer(buf); + return buf.array(); + } /** * Perform a sanity check on the packet, returning true if it is sane.
          Hide
          Suresh Srinivas added a comment -

          Also there are bunch of empty line additions in the patch that could be removed (in DFSOutputStream.java, RemoteBlockReader2.java etc.).

          Show
          Suresh Srinivas added a comment - Also there are bunch of empty line additions in the patch that could be removed (in DFSOutputStream.java, RemoteBlockReader2.java etc.).
          Hide
          Suresh Srinivas added a comment - - edited

          Todd, I meant to review this. But the code refactoring, even though it is a good idea, has made the review difficult. Given that I may not be able to get though my code review in a short period of time, here are the comments I had accumulated based on my review so far:

          1. DFSOutputStream.java
            • Packet consturctor, can you please add javadoc (especially to describe pktSize)
            • Math in computePacketChunkSize seems correct, but it results in different value from previous code.
          2. PacketReceiver.java
            • Make #bufferPool final
          3. PacketHeader.java
            • Builder import not used
            • Please add javadoc on PacketHeader structure
            • BlockSender.java javadoc could just point to the javadoc of PacketHeader for header information. We have this in multiple places.
          Show
          Suresh Srinivas added a comment - - edited Todd, I meant to review this. But the code refactoring, even though it is a good idea, has made the review difficult. Given that I may not be able to get though my code review in a short period of time, here are the comments I had accumulated based on my review so far: DFSOutputStream.java Packet consturctor, can you please add javadoc (especially to describe pktSize) Math in computePacketChunkSize seems correct, but it results in different value from previous code. PacketReceiver.java Make #bufferPool final PacketHeader.java Builder import not used Please add javadoc on PacketHeader structure BlockSender.java javadoc could just point to the javadoc of PacketHeader for header information. We have this in multiple places.
          Hide
          Aaron T. Myers added a comment -

          The patch looks pretty good to me, and I agree it's a good refactor to make RemoteBlockReader and RemoteBlockReader2 share some code.

          One tiny nit on the code, it looks like you have some extra whitespace here:

          if ( checksumBuf.capacity() != checksumLen) {
          

          It looks to me like the TestFileConcurrentReader failure is due to this patch. I can't recall that test being flaky, and at least on my box the test passes without this patch, but fails with it applied.

          Show
          Aaron T. Myers added a comment - The patch looks pretty good to me, and I agree it's a good refactor to make RemoteBlockReader and RemoteBlockReader2 share some code. One tiny nit on the code, it looks like you have some extra whitespace here: if ( checksumBuf.capacity() != checksumLen) { It looks to me like the TestFileConcurrentReader failure is due to this patch. I can't recall that test being flaky, and at least on my box the test passes without this patch, but fails with it applied.
          Hide
          Lars Hofhansl added a comment -

          Saw Todd's comment on HDFS-744 (moving that change to 2.0.x is not an option).
          I had assumed that HDFS-744 would be in 2.0.x (in which case there would have been no compatibility issues).

          Had a quick look through the patch here. Looks good as far as I can tell, I'll take more detailed look later today.

          Show
          Lars Hofhansl added a comment - Saw Todd's comment on HDFS-744 (moving that change to 2.0.x is not an option). I had assumed that HDFS-744 would be in 2.0.x (in which case there would have been no compatibility issues). Had a quick look through the patch here. Looks good as far as I can tell, I'll take more detailed look later today.
          Hide
          Lars Hofhansl added a comment -

          Oh this is a 2.1.x vs 2.0.x issue...?
          The HDFS-744 patch is smaller than this patch, could we port that to 2.0.x?

          Show
          Lars Hofhansl added a comment - Oh this is a 2.1.x vs 2.0.x issue...? The HDFS-744 patch is smaller than this patch, could we port that to 2.0.x?
          Hide
          Lars Hofhansl added a comment -

          What sort of wire-compatibility are we talking about? Both trunk and 2.0 have the hsync code. What sort of old cluster would not have this? Does the 2.x.x client support communicating with a 1.x.x cluster?

          Apologies for introducing this with my patch in HDFS-744, I had assumed protobuf will take care of it.

          Show
          Lars Hofhansl added a comment - What sort of wire-compatibility are we talking about? Both trunk and 2.0 have the hsync code. What sort of old cluster would not have this? Does the 2.x.x client support communicating with a 1.x.x cluster? Apologies for introducing this with my patch in HDFS-744 , I had assumed protobuf will take care of it.
          Hide
          Todd Lipcon added a comment -

          Todd, I will review this in a couple of days. Looking at the issue briefly, this is only an issue with DataTransferProtocol, due to packet header length. It should not happen in RPC right? I would be good to review RPC code once more to confirm it.

          I think RPC is basically sound. I've done diffs of the protobufs between trunk and 2.0 and all the changes look compatible. This was just a special case because, when I did the original PB-ification, I took the simpler route of maintaining fixed-length headers. Then I overlooked the bit of HDFS-744 which made the header non-fixed-width (oops).

          In a related note, we should think about what backward/forward compatibility for an alpha release means and how to handle it when we find issues.

          In that case, I'd also like to consider whether we can promote the HDFS side of 2.0.x to "beta". We've seen very few issues after extensive testing. I don't want the alpha state of MR2 to hold back stability classification of HDFS, which is important for those users who aren't using MR2. But that's probably better suited for a mailing list discussion.

          Show
          Todd Lipcon added a comment - Todd, I will review this in a couple of days. Looking at the issue briefly, this is only an issue with DataTransferProtocol, due to packet header length. It should not happen in RPC right? I would be good to review RPC code once more to confirm it. I think RPC is basically sound. I've done diffs of the protobufs between trunk and 2.0 and all the changes look compatible. This was just a special case because, when I did the original PB-ification, I took the simpler route of maintaining fixed-length headers. Then I overlooked the bit of HDFS-744 which made the header non-fixed-width (oops). In a related note, we should think about what backward/forward compatibility for an alpha release means and how to handle it when we find issues. In that case, I'd also like to consider whether we can promote the HDFS side of 2.0.x to "beta". We've seen very few issues after extensive testing. I don't want the alpha state of MR2 to hold back stability classification of HDFS, which is important for those users who aren't using MR2. But that's probably better suited for a mailing list discussion.
          Hide
          Suresh Srinivas added a comment -

          Todd, I will review this in a couple of days. Looking at the issue briefly, this is only an issue with DataTransferProtocol, due to packet header length. It should not happen in RPC right? I would be good to review RPC code once more to confirm it.

          In a related note, we should think about what backward/forward compatibility for an alpha release means and how to handle it when we find issues.

          Show
          Suresh Srinivas added a comment - Todd, I will review this in a couple of days. Looking at the issue briefly, this is only an issue with DataTransferProtocol, due to packet header length. It should not happen in RPC right? I would be good to review RPC code once more to confirm it. In a related note, we should think about what backward/forward compatibility for an alpha release means and how to handle it when we find issues.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12537798/hdfs-3721.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.hdfs.TestFileConcurrentReader

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2902//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2902//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12537798/hdfs-3721.txt against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 javadoc. The javadoc tool did not generate any warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestFileConcurrentReader +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2902//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2902//console This message is automatically generated.
          Hide
          Todd Lipcon added a comment -

          (I should also note that, even with this patch, a 2.0 client can still talk to a trunk cluster, since the server now properly handles variable-length headers)

          Show
          Todd Lipcon added a comment - (I should also note that, even with this patch, a 2.0 client can still talk to a trunk cluster, since the server now properly handles variable-length headers)
          Hide
          Todd Lipcon added a comment -

          This patch fixes the issue as follows:

          • Refactors out the packet-reading code from BlockReceiver and RemoteBlockReader2 into a new PacketReceiver class. This really simplified BlockReceiver in particular, and has the nice side effect of getting us significantly closer to HDFS-3529.
          • All places where we used to assume a fixed-length packet header now support variable length. In some cases this is achieved by allocating a larger-than-necessary buffer and then, once we know the size for the header, putting it at the right spot to make the header contiguous with the data. In other cases, this is achieved by simply separating the buffer containing the header from the buffer containing the data.
          • Regarding the issue above with 2.1 clients writing to 2.0 servers, I fixed the issue by having the PacketHeader class not set any value for the syncBlock flag when it is false. That means that, so long as the new hsync functionality isn't used, new clients can still talk to 2.0 servers. If hsync is used, an error will occur. It's slightly unfortunate, but given that the 2.0 branch is pretty new, and this is a new feature. I think this is acceptable.

          I manually tested a trunk client both reading and writing from a 2.0.0-alpha cluster with this patch applied.

          Show
          Todd Lipcon added a comment - This patch fixes the issue as follows: Refactors out the packet-reading code from BlockReceiver and RemoteBlockReader2 into a new PacketReceiver class. This really simplified BlockReceiver in particular, and has the nice side effect of getting us significantly closer to HDFS-3529 . All places where we used to assume a fixed-length packet header now support variable length. In some cases this is achieved by allocating a larger-than-necessary buffer and then, once we know the size for the header, putting it at the right spot to make the header contiguous with the data. In other cases, this is achieved by simply separating the buffer containing the header from the buffer containing the data. Regarding the issue above with 2.1 clients writing to 2.0 servers, I fixed the issue by having the PacketHeader class not set any value for the syncBlock flag when it is false. That means that, so long as the new hsync functionality isn't used, new clients can still talk to 2.0 servers. If hsync is used, an error will occur. It's slightly unfortunate, but given that the 2.0 branch is pretty new, and this is a new feature. I think this is acceptable. I manually tested a trunk client both reading and writing from a 2.0.0-alpha cluster with this patch applied.
          Hide
          Todd Lipcon added a comment -

          I have a fix for this, that rewrites BlockReceiver and RemoteBlockReader2 to handle variable-length packet headers. Unfortunately, since it modifies the server side, it still doesn't allow a new client to write to an old server (just read).

          I'm trying to think of a creative way to fix this issue – perhaps adding a flag in the response to "writeBlock" which indicates supportsVariableLengthPacketHeader. If the flag isn't set, then hsync wouldn't be supported on this stream, and it would fall back to the old fixed format header.

          Show
          Todd Lipcon added a comment - I have a fix for this, that rewrites BlockReceiver and RemoteBlockReader2 to handle variable-length packet headers. Unfortunately, since it modifies the server side, it still doesn't allow a new client to write to an old server (just read). I'm trying to think of a creative way to fix this issue – perhaps adding a flag in the response to "writeBlock" which indicates supportsVariableLengthPacketHeader . If the flag isn't set, then hsync wouldn't be supported on this stream, and it would fall back to the old fixed format header.
          Hide
          Todd Lipcon added a comment -

          This also affects the read path, since it uses the same packet header as well assuming it's fixed size.

          Show
          Todd Lipcon added a comment - This also affects the read path, since it uses the same packet header as well assuming it's fixed size.
          Hide
          Todd Lipcon added a comment -

          The issue here is the following code:

            /** Header size for a packet */
            private static final int PROTO_SIZE = 
              PacketHeaderProto.newBuilder()
                .setOffsetInBlock(0)
                .setSeqno(0)
                .setLastPacketInBlock(false)
                .setDataLen(0)
                .setSyncBlock(false)
                .build().getSerializedSize();
            public static final int PKT_HEADER_LEN =
              6 + PROTO_SIZE;
          

          Since the new syncBlock flag is optional, this caused the packet header to become variable-length depending on whether the client is post-hsync or not. This screws up the datanode, resulting in an exception:

          12/07/24 13:55:45 INFO datanode.DataNode: Exception in receiveBlock for BP-2093170007-127.0.0.1-1342943513882:blk_3332306339985613438_1008
          java.io.IOException: Data remaining in packet does not matchsum of checksumLen and dataLen  size remaining: 22 data len: 20 checksum Len: 4
                  at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:595)
                  at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:532)
                  at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:748)
          
          Show
          Todd Lipcon added a comment - The issue here is the following code: /** Header size for a packet */ private static final int PROTO_SIZE = PacketHeaderProto.newBuilder() .setOffsetInBlock(0) .setSeqno(0) .setLastPacketInBlock( false ) .setDataLen(0) .setSyncBlock( false ) .build().getSerializedSize(); public static final int PKT_HEADER_LEN = 6 + PROTO_SIZE; Since the new syncBlock flag is optional, this caused the packet header to become variable-length depending on whether the client is post-hsync or not. This screws up the datanode, resulting in an exception: 12/07/24 13:55:45 INFO datanode.DataNode: Exception in receiveBlock for BP-2093170007-127.0.0.1-1342943513882:blk_3332306339985613438_1008 java.io.IOException: Data remaining in packet does not matchsum of checksumLen and dataLen size remaining: 22 data len: 20 checksum Len: 4 at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:595) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:532) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:748)

            People

            • Assignee:
              Aaron T. Myers
              Reporter:
              Todd Lipcon
            • Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development