Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-1001

DataXceiver and BlockReader disagree on when to send/recv CHECKSUM_OK

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.22.0, 0.23.0
    • Component/s: datanode
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Running the TestPread with additional debug statements reveals that the BlockReader sends CHECKSUM_OK when the DataXceiver doesn't expect it. Currently it doesn't matter since DataXceiver closes the connection after each op, and CHECKSUM_OK is the last thing on the wire. But if we want to cache connections, they need to agree on the exchange of CHECKSUM_OK.

      1. HDFS-1001-6.patch
        17 kB
        Eli Collins
      2. HDFS-1001-5.patch
        17 kB
        Eli Collins
      3. HDFS-1001-4.patch
        17 kB
        bc Wong
      4. HDFS-1001-3.patch
        17 kB
        bc Wong
      5. HDFS-1001-3.patch
        17 kB
        bc Wong
      6. HDFS-1001-2.patch
        17 kB
        bc Wong
      7. HDFS-1001-rebased.patch
        9 kB
        bc Wong
      8. HDFS-1001.patch.1
        9 kB
        bc Wong
      9. HDFS-1001.patch
        9 kB
        bc Wong

        Issue Links

          Activity

          Hide
          Todd Lipcon added a comment -

          This is a bug on the datanode side, right? The tests from HDFS-877 at least ensure that CHECKSUM_OK works properly where it's supposed to.

          Show
          Todd Lipcon added a comment - This is a bug on the datanode side, right? The tests from HDFS-877 at least ensure that CHECKSUM_OK works properly where it's supposed to.
          Hide
          bc Wong added a comment -

          Yes, Todd. The client sends CHECKSUM_OK when it reads to the end of the data. The DN expects CHECKSUM_OK only when it sends the entire block.

          It's my goal that either side can independently decide whether a CHECKSUM_OK is supposed to be present. (A good protocol should do that.) So I propose:
          1. The client keeps its current behaviour, sending CHECKSUM_OK when it reads to the end of the data.
          2. The DN, after sending all the data, always expects a CHECKSUM_OK. If it's not there, it will close the client connection.
          3. Point #2 does not matter currently, since the DN always closes the client connection. But HDFS-941 may change that.

          I can't think of a good way to write a test for this. Any suggestions please?

          Show
          bc Wong added a comment - Yes, Todd. The client sends CHECKSUM_OK when it reads to the end of the data. The DN expects CHECKSUM_OK only when it sends the entire block. It's my goal that either side can independently decide whether a CHECKSUM_OK is supposed to be present. (A good protocol should do that.) So I propose: 1. The client keeps its current behaviour, sending CHECKSUM_OK when it reads to the end of the data. 2. The DN, after sending all the data, always expects a CHECKSUM_OK. If it's not there, it will close the client connection. 3. Point #2 does not matter currently, since the DN always closes the client connection. But HDFS-941 may change that. I can't think of a good way to write a test for this. Any suggestions please?
          Hide
          Todd Lipcon added a comment -

          Testing this is a bit difficult. I think the one very important test is the negative test - ie that we don't end up calling datanode.blockScanner.verifiedByClient() for a partial read.

          I think you could add this as another test case in TestClientBlockVerification - just like testCompletePartialRead, except also replace the datanode.blockScanner with a spy and verify that verifiedByClient is not called for the partial read.

          Show
          Todd Lipcon added a comment - Testing this is a bit difficult. I think the one very important test is the negative test - ie that we don't end up calling datanode.blockScanner.verifiedByClient() for a partial read. I think you could add this as another test case in TestClientBlockVerification - just like testCompletePartialRead, except also replace the datanode.blockScanner with a spy and verify that verifiedByClient is not called for the partial read.
          Hide
          bc Wong added a comment -

          I have a patch. But I can't assign this issue to myself. Could someone please fix Jira to let me work on it? Thanks.

          Show
          bc Wong added a comment - I have a patch. But I can't assign this issue to myself. Could someone please fix Jira to let me work on it? Thanks.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org
          against trunk revision 916902.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          -1 patch. The patch command could not apply the patch.

          Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/257/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org against trunk revision 916902. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/257/console This message is automatically generated.
          Hide
          bc Wong added a comment -

          Updated documentation on the semantics of CHECKSUM_OK. Code fixes and added a new test for negative block verification, as Todd suggested.

          Show
          bc Wong added a comment - Updated documentation on the semantics of CHECKSUM_OK. Code fixes and added a new test for negative block verification, as Todd suggested.
          Hide
          Todd Lipcon added a comment -

          Hey bc,
          I think there may be a small error in the test. In testCompletePartialRead, you spy on the blockScanner, but don't assign the spy back into this.getDataNode() - thus the verifiedByClient call won't go through the spy object and the test will always pass (at least by my understanding of how mockito works). I think you need to do:

          DataBlockScanner scanner = spy(this.getDataNode().blockScanner);
          this.getDatanode().blockScanner = scanner;
          ...
          

          Can you verify that the test fails properly if the change to the code is reverted?

          Show
          Todd Lipcon added a comment - Hey bc, I think there may be a small error in the test. In testCompletePartialRead, you spy on the blockScanner, but don't assign the spy back into this.getDataNode() - thus the verifiedByClient call won't go through the spy object and the test will always pass (at least by my understanding of how mockito works). I think you need to do: DataBlockScanner scanner = spy( this .getDataNode().blockScanner); this .getDatanode().blockScanner = scanner; ... Can you verify that the test fails properly if the change to the code is reverted?
          Hide
          bc Wong added a comment -

          > Can you verify that the test fails properly if the change to the code is reverted?

          The test fails if I verify the opposite.

          Testcase: testCompletePartialRead took 0.609 sec
                  Caused an ERROR
          
          Wanted but not invoked:
          dataBlockScanner.verifiedByClient(
              isA(org.apache.hadoop.hdfs.protocol.Block)
          );
          -> at org.apache.hadoop.hdfs.server.datanode.TestDataXceiver.testCompletePartialRead(TestDataXceiver.java:146)
          Actually, there were zero interactions with this mock.
          
          Wanted but not invoked:
          dataBlockScanner.verifiedByClient(
              isA(org.apache.hadoop.hdfs.protocol.Block)
          );
          -> at org.apache.hadoop.hdfs.server.datanode.TestDataXceiver.testCompletePartialRead(TestDataXceiver.java:146)
          Actually, there were zero interactions with this mock.
          
                  at org.apache.hadoop.hdfs.server.datanode.TestDataXceiver.testCompletePartialRead(TestDataXceiver.java:146)
          

          According to Mockito documentation, they don't recommend operating on the spy object directly. Also see Q1 here:
          <http://mockito.googlecode.com/svn/branches/1.6/javadoc/org/mockito/Mockito.html>

          Show
          bc Wong added a comment - > Can you verify that the test fails properly if the change to the code is reverted? The test fails if I verify the opposite. Testcase: testCompletePartialRead took 0.609 sec Caused an ERROR Wanted but not invoked: dataBlockScanner.verifiedByClient( isA(org.apache.hadoop.hdfs.protocol.Block) ); -> at org.apache.hadoop.hdfs.server.datanode.TestDataXceiver.testCompletePartialRead(TestDataXceiver.java:146) Actually, there were zero interactions with this mock. Wanted but not invoked: dataBlockScanner.verifiedByClient( isA(org.apache.hadoop.hdfs.protocol.Block) ); -> at org.apache.hadoop.hdfs.server.datanode.TestDataXceiver.testCompletePartialRead(TestDataXceiver.java:146) Actually, there were zero interactions with this mock. at org.apache.hadoop.hdfs.server.datanode.TestDataXceiver.testCompletePartialRead(TestDataXceiver.java:146) According to Mockito documentation, they don't recommend operating on the spy object directly. Also see Q1 here: < http://mockito.googlecode.com/svn/branches/1.6/javadoc/org/mockito/Mockito.html >
          Hide
          Todd Lipcon added a comment -

          The test fails if I verify the opposite.

          But what if you verify what you're trying to test, and then revert the code so it breaks? I think the way you're using the spy right now will always think there are no interactions.

          Show
          Todd Lipcon added a comment - The test fails if I verify the opposite. But what if you verify what you're trying to test, and then revert the code so it breaks? I think the way you're using the spy right now will always think there are no interactions.
          Hide
          bc Wong added a comment -

          Ah, I misunderstood the documentation. You're right. Good catch.

          Show
          bc Wong added a comment - Ah, I misunderstood the documentation. You're right. Good catch.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12437622/HDFS-1001.patch
          against trunk revision 916902.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/258/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/258/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/258/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/258/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12437622/HDFS-1001.patch against trunk revision 916902. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/258/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/258/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/258/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/258/console This message is automatically generated.
          Hide
          bc Wong added a comment -

          Test failure (in TestSecurityTokenEditLog) seems to be HDFS-1015.

          Show
          bc Wong added a comment - Test failure (in TestSecurityTokenEditLog) seems to be HDFS-1015 .
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12438935/HDFS-1001-rebased.patch
          against trunk revision 923467.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/271/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/271/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/271/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/271/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12438935/HDFS-1001-rebased.patch against trunk revision 923467. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/271/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/271/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/271/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/271/console This message is automatically generated.
          Hide
          Todd Lipcon added a comment -

          The body of patch looks good to me. But could we merge TestClientBlockVerification and the new TestDataXceiver? I recall you made the new test so you could be in the server.datanode package, but could the cases of TestClientBlockVerification move in here too? If not, maybe we can at least share a bit of the test code (most of the code except for the test case itself is duplicated)

          Show
          Todd Lipcon added a comment - The body of patch looks good to me. But could we merge TestClientBlockVerification and the new TestDataXceiver? I recall you made the new test so you could be in the server.datanode package, but could the cases of TestClientBlockVerification move in here too? If not, maybe we can at least share a bit of the test code (most of the code except for the test case itself is duplicated)
          Hide
          bc Wong added a comment -

          Factored out common test code.

          Show
          bc Wong added a comment - Factored out common test code.
          Hide
          bc Wong added a comment -

          Rebased patch HDFS-1001-3.patch

          Show
          bc Wong added a comment - Rebased patch HDFS-1001 -3.patch
          Hide
          bc Wong added a comment -

          Ousp. Previous patch was in reverse direction.

          Show
          bc Wong added a comment - Ousp. Previous patch was in reverse direction.
          Hide
          Eli Collins added a comment -

          +1 Nice change.

          Nits:

          • in DataNode "DataNode always expects" should read "always checks" since the response is optional.
          • Would rename readCasually to something like readBytesCheckEOS
          • In DataXceiver "from client" should read "from the client"
          Show
          Eli Collins added a comment - +1 Nice change. Nits: in DataNode "DataNode always expects" should read "always checks" since the response is optional. Would rename readCasually to something like readBytesCheckEOS In DataXceiver "from client" should read "from the client"
          Hide
          bc Wong added a comment -

          New -4 patch per Eli's comments.

          Show
          bc Wong added a comment - New -4 patch per Eli's comments.
          Hide
          bc Wong added a comment -

          Note that this change does not increase the client read latency. The existing code is already sending CHECKSUM_OK even for partial block read. The fix codifies this behaviour and makes sure that the DN side expects this CHECKSUM_OK.

          Another way to fix this bug is to make the client not send CHECKSUM_OK for partial block reads. However, that complicates the protocol and makes it unclean. The cost of sending the CHECKSUM_OK is small enough. It is one extra syscall. But it's not network latency since the client doesn't need a response.

          Show
          bc Wong added a comment - Note that this change does not increase the client read latency. The existing code is already sending CHECKSUM_OK even for partial block read. The fix codifies this behaviour and makes sure that the DN side expects this CHECKSUM_OK. Another way to fix this bug is to make the client not send CHECKSUM_OK for partial block reads. However, that complicates the protocol and makes it unclean. The cost of sending the CHECKSUM_OK is small enough. It is one extra syscall. But it's not network latency since the client doesn't need a response.
          Hide
          Eli Collins added a comment -

          Another way to fix this bug is to make the client not send CHECKSUM_OK for partial block reads. However, that complicates the protocol and makes it unclean.

          Agree the way you currently have it is better. Latest patch looks good to me.

          Show
          Eli Collins added a comment - Another way to fix this bug is to make the client not send CHECKSUM_OK for partial block reads. However, that complicates the protocol and makes it unclean. Agree the way you currently have it is better. Latest patch looks good to me.
          Hide
          Eli Collins added a comment -

          Patch attached rebased against trunk.

          Show
          Eli Collins added a comment - Patch attached rebased against trunk.
          Hide
          Eli Collins added a comment -

          Attached a patch fixes a warning in the last patch.

          Show
          Eli Collins added a comment - Attached a patch fixes a warning in the last patch.
          Hide
          Eli Collins added a comment -

          Unit test results look the same as trunk before the latest patch.

               [exec] 
               [exec] +1 overall.  
               [exec] 
               [exec]     +1 @author.  The patch does not contain any @author tags.
               [exec] 
               [exec]     +1 tests included.  The patch appears to include 9 new or modified tests.
               [exec] 
               [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
               [exec] 
               [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
               [exec] 
               [exec]     +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.
               [exec] 
               [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.
               [exec] 
               [exec]     +1 system test framework.  The patch passed system test framework compile.
               [exec] 
          
          Show
          Eli Collins added a comment - Unit test results look the same as trunk before the latest patch. [exec] [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 9 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec] [exec] +1 system test framework. The patch passed system test framework compile. [exec]
          Hide
          Eli Collins added a comment -

          I've committed this to trunk and branch 22. Thanks bc!

          Show
          Eli Collins added a comment - I've committed this to trunk and branch 22. Thanks bc!

            People

            • Assignee:
              bc Wong
              Reporter:
              bc Wong
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development