Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-2021

TestWriteRead failed with inconsistent visible length of a file

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.23.0
    • Component/s: datanode
    • Labels:
      None
    • Environment:

      Linux RHEL5

    • Hadoop Flags:
      Reviewed

      Description

      The junit test failed when iterates a number of times with larger chunk size on Linux. Once a while, the visible number of bytes seen by a reader is slightly less than what was supposed to be.

      When run with the following parameter, it failed more often on Linux ( as reported by John George) than my Mac:
      private static final int WR_NTIMES = 300;
      private static final int WR_CHUNK_SIZE = 10000;

      Adding more debugging output to the source, this is a sample of the output:
      Caused by: java.io.IOException: readData mismatch in byte read: expected=2770000 ; got 2765312
      at org.apache.hadoop.hdfs.TestWriteRead.readData(TestWriteRead.java:141)

      1. HDFS-2021-2.patch
        2 kB
        John George
      2. HDFS-2021.patch
        1 kB
        John George

        Activity

        Hide
        John George added a comment -

        The responder thread in block receiver sends an ack back to its upstream and then sets numBytesAcked. When running DN,NN and client on a single node machine, this causes a race wherein the client gets an ack back and sends a request to DN to get the length of visiblebytes, but since "numBytesAcked" is not yet set on DN, getVisibleBytes() returns the previous value, which is incorrect.

        The patch sets the value of numBytesAcked before sending out the ack, instead of waiting for the ack to succeed. It does not seem like it matters whether the "ack" was sent successfully or not, to set numBytesAcked.

        Show
        John George added a comment - The responder thread in block receiver sends an ack back to its upstream and then sets numBytesAcked. When running DN,NN and client on a single node machine, this causes a race wherein the client gets an ack back and sends a request to DN to get the length of visiblebytes, but since "numBytesAcked" is not yet set on DN, getVisibleBytes() returns the previous value, which is incorrect. The patch sets the value of numBytesAcked before sending out the ack, instead of waiting for the ack to succeed. It does not seem like it matters whether the "ack" was sent successfully or not, to set numBytesAcked.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12481034/HDFS-2021.patch
        against trunk revision 1129942.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these core unit tests:
        org.apache.hadoop.hdfs.server.namenode.TestBlockTokenWithDFS
        org.apache.hadoop.hdfs.TestDatanodeDeath
        org.apache.hadoop.hdfs.TestDFSUpgradeFromImage
        org.apache.hadoop.hdfs.TestReplaceDatanodeOnFailure

        +1 contrib tests. The patch passed contrib unit tests.

        +1 system test framework. The patch passed system test framework compile.

        Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/670//testReport/
        Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/670//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/670//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12481034/HDFS-2021.patch against trunk revision 1129942. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.hdfs.server.namenode.TestBlockTokenWithDFS org.apache.hadoop.hdfs.TestDatanodeDeath org.apache.hadoop.hdfs.TestDFSUpgradeFromImage org.apache.hadoop.hdfs.TestReplaceDatanodeOnFailure +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/670//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/670//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/670//console This message is automatically generated.
        Hide
        Daryn Sharp added a comment -

        I noticed that you omitted the conditional replyAck.isSuccess() when you moved the code block that updates the bytesAcked. The isSuccess() isn't tied to whether the ack was successfully sent upstream, but rather whether the downstreams were all successful, thus is seems like the conditional should be reinserted to preserve the current behavior. Changing the overall logic seems fraught with peril...

        That said, I'm a bit confused about why a datanode updates its bytesAcked iff all downstreams are successful. The datanode received and wrote those bytes so it seems like the conditional isn't needed in either case. Unless... bytesAcked is intended to track exactly how many bytes were written throughout the entire pipeline. I'd think that a pipeline should write as much as it can even if downstreams are lost, then backfill the under-replicated blocks. To satisfy curiosity, perhaps someone with more knowledge of the code will comment.

        Show
        Daryn Sharp added a comment - I noticed that you omitted the conditional replyAck.isSuccess() when you moved the code block that updates the bytesAcked . The isSuccess() isn't tied to whether the ack was successfully sent upstream, but rather whether the downstreams were all successful, thus is seems like the conditional should be reinserted to preserve the current behavior. Changing the overall logic seems fraught with peril... That said, I'm a bit confused about why a datanode updates its bytesAcked iff all downstreams are successful. The datanode received and wrote those bytes so it seems like the conditional isn't needed in either case. Unless... bytesAcked is intended to track exactly how many bytes were written throughout the entire pipeline. I'd think that a pipeline should write as much as it can even if downstreams are lost, then backfill the under-replicated blocks. To satisfy curiosity, perhaps someone with more knowledge of the code will comment.
        Hide
        Tsz Wo Nicholas Sze added a comment -

        Daryn, I agree that we need replyAck.isSuccess().

        That said, I'm a bit confused about why a datanode updates its bytesAcked iff all downstreams are successful. ... bytesAcked is intended to track exactly how many bytes were written throughout the entire pipeline ...

        You are totally correct that it is the intention; see Section 3.3 in the Append Design Doc in HDFS-265.

        Show
        Tsz Wo Nicholas Sze added a comment - Daryn, I agree that we need replyAck.isSuccess() . That said, I'm a bit confused about why a datanode updates its bytesAcked iff all downstreams are successful. ... bytesAcked is intended to track exactly how many bytes were written throughout the entire pipeline ... You are totally correct that it is the intention; see Section 3.3 in the Append Design Doc in HDFS-265 .
        Hide
        John George added a comment -

        attached a newer patch with the comment from Daryn and also modified TestWriteRead.java to add the unit test for this.

        Show
        John George added a comment - attached a newer patch with the comment from Daryn and also modified TestWriteRead.java to add the unit test for this.
        Hide
        Daryn Sharp added a comment -

        +1
        Looks good. Presumably increasing the number of writes and the chunk size is to more easily induce the problem. I hope it doesn't add much runtime to the test suite...

        Show
        Daryn Sharp added a comment - +1 Looks good. Presumably increasing the number of writes and the chunk size is to more easily induce the problem. I hope it doesn't add much runtime to the test suite...
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12481127/HDFS-2021-2.patch
        against trunk revision 1130262.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these core unit tests:
        org.apache.hadoop.hdfs.TestDFSUpgradeFromImage

        +1 contrib tests. The patch passed contrib unit tests.

        +1 system test framework. The patch passed system test framework compile.

        Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/675//testReport/
        Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/675//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/675//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12481127/HDFS-2021-2.patch against trunk revision 1130262. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.hdfs.TestDFSUpgradeFromImage +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/675//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/675//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/675//console This message is automatically generated.
        Hide
        Tsz Wo Nicholas Sze added a comment -

        The failure of TestDFSUpgradeFromImage is not related.

        Thanks Daryn for reviewing the patches.

        I have committed this. Thanks, John!

        Show
        Tsz Wo Nicholas Sze added a comment - The failure of TestDFSUpgradeFromImage is not related. Thanks Daryn for reviewing the patches. I have committed this. Thanks, John!
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-trunk #685 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk/685/)
        HDFS-2021. Update numBytesAcked before sending the ack in PacketResponder. Contributed by John George

        szetszwo : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1130339
        Files :

        • /hadoop/hdfs/trunk/src/java/org/apache/hadoop/hdfs/server/datanode/BlockReceiver.java
        • /hadoop/hdfs/trunk/src/test/hdfs/org/apache/hadoop/hdfs/TestWriteRead.java
        • /hadoop/hdfs/trunk/CHANGES.txt
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #685 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk/685/ ) HDFS-2021 . Update numBytesAcked before sending the ack in PacketResponder. Contributed by John George szetszwo : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1130339 Files : /hadoop/hdfs/trunk/src/java/org/apache/hadoop/hdfs/server/datanode/BlockReceiver.java /hadoop/hdfs/trunk/src/test/hdfs/org/apache/hadoop/hdfs/TestWriteRead.java /hadoop/hdfs/trunk/CHANGES.txt
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-trunk-Commit #704 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/704/)

        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #704 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/704/ )

          People

          • Assignee:
            John George
            Reporter:
            CW Chung
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development