Details

    • Type: Sub-task Sub-task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.23.0
    • Component/s: test
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    1. h1923_20110527.patch
      4 kB
      Tsz Wo Nicholas Sze

      Issue Links

        Activity

        Matt Foley created issue -
        Matt Foley made changes -
        Field Original Value New Value
        Component/s test [ 12312916 ]
        Hide
        Matt Foley added a comment -

        Test case TestFiDataTransferProtocol2.pipeline_Fi_29() has failed in builds
        409, 425, 455, 460, 464, 469, 481, 483, 484.

        Failure mode varies.

        Show
        Matt Foley added a comment - Test case TestFiDataTransferProtocol2.pipeline_Fi_29() has failed in builds 409, 425, 455, 460, 464, 469, 481, 483, 484. Failure mode varies.
        Tsz Wo Nicholas Sze made changes -
        Link This issue relates to HADOOP-7270 [ HADOOP-7270 ]
        Hide
        Tsz Wo Nicholas Sze added a comment -

        In some cases, the failures were caused by

        2011-05-11 20:14:27,257 ERROR datanode.DataNode (DataXceiver.java:run(133)) - 127.0.0.1:37620:DataXceiver
        java.lang.NullPointerException
        	at org.apache.hadoop.ipc.Server$Listener.getAddress(Server.java:518)
        

        Filed HADOOP-7270 earlier.

        Show
        Tsz Wo Nicholas Sze added a comment - In some cases, the failures were caused by 2011-05-11 20:14:27,257 ERROR datanode.DataNode (DataXceiver.java:run(133)) - 127.0.0.1:37620:DataXceiver java.lang.NullPointerException at org.apache.hadoop.ipc.Server$Listener.getAddress(Server.java:518) Filed HADOOP-7270 earlier.
        Hide
        Tsz Wo Nicholas Sze added a comment -

        In some other cases, it failed with

        Failed to add a datanode: nodes.length != original.length + 1, nodes=[127.0.0.1:51603], original=[127.0.0.1:51603]
        java.io.IOException: Failed to add a datanode: nodes.length != original.length + 1, nodes=[127.0.0.1:51603], original=[127.0.0.1:51603]
        	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:768)
        

        It seems that there are not enough datanodes. I will check this.

        Show
        Tsz Wo Nicholas Sze added a comment - In some other cases, it failed with Failed to add a datanode: nodes.length != original.length + 1, nodes=[127.0.0.1:51603], original=[127.0.0.1:51603] java.io.IOException: Failed to add a datanode: nodes.length != original.length + 1, nodes=[127.0.0.1:51603], original=[127.0.0.1:51603] at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:768) It seems that there are not enough datanodes. I will check this.
        Tsz Wo Nicholas Sze made changes -
        Assignee Tsz Wo (Nicholas), SZE [ szetszwo ]
        Hide
        Tsz Wo Nicholas Sze added a comment -

        There are also some cases that it failed with

        java.net.SocketTimeoutException: 10000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/127.0.0.1:54033 remote=/127.0.0.1:54027]
        	at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
        

        It seems that the test sleep to long for randomizing datanode speed.

        Show
        Tsz Wo Nicholas Sze added a comment - There are also some cases that it failed with java.net.SocketTimeoutException: 10000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/127.0.0.1:54033 remote=/127.0.0.1:54027] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) It seems that the test sleep to long for randomizing datanode speed.
        Hide
        Tsz Wo Nicholas Sze added a comment -

        h1923_20110527.patch: added one more datanode in the cluster and used a shorter sleep time period.

        Show
        Tsz Wo Nicholas Sze added a comment - h1923_20110527.patch: added one more datanode in the cluster and used a shorter sleep time period.
        Tsz Wo Nicholas Sze made changes -
        Attachment h1923_20110527.patch [ 12480726 ]
        Tsz Wo Nicholas Sze made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12480726/h1923_20110527.patch
        against trunk revision 1128542.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these core unit tests:
        org.apache.hadoop.hdfs.server.namenode.TestBackupNode

        +1 contrib tests. The patch passed contrib unit tests.

        +1 system test framework. The patch passed system test framework compile.

        Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/654//testReport/
        Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/654//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/654//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12480726/h1923_20110527.patch against trunk revision 1128542. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.hdfs.server.namenode.TestBackupNode +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/654//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/654//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/654//console This message is automatically generated.
        Hide
        Tsz Wo Nicholas Sze added a comment -

        TestFiDataTransferProtocol2 passed but then TestBackupNode failed.

        Show
        Tsz Wo Nicholas Sze added a comment - TestFiDataTransferProtocol2 passed but then TestBackupNode failed.
        Hide
        Konstantin Boudnik added a comment -

        Nicholas, TestBackupNode shouldn't be affect by your patch - it has to be something unrelated, no?

        Show
        Konstantin Boudnik added a comment - Nicholas, TestBackupNode shouldn't be affect by your patch - it has to be something unrelated, no?
        Hide
        Tsz Wo Nicholas Sze added a comment -

        Of course, TestBackupNode is nothing to do with my patch. I thought that my patch would get an overall +1 from Hudson.

        Show
        Tsz Wo Nicholas Sze added a comment - Of course, TestBackupNode is nothing to do with my patch. I thought that my patch would get an overall +1 from Hudson.
        Hide
        Todd Lipcon added a comment -

        Hey Nicholas. I understand the reasoning to make the MAX_SLEEP lower, but can you explain why we need one more datanode?

        Show
        Todd Lipcon added a comment - Hey Nicholas. I understand the reasoning to make the MAX_SLEEP lower, but can you explain why we need one more datanode?
        Hide
        Tsz Wo Nicholas Sze added a comment -

        There may be multiple failures in some cases. It requires more datanodes, otherwise, it will fail on adding a datanode to the write pipeline.

        The SocketTimeoutException definitely will cause multiple failures. I am not sure if there are other cases.

        Show
        Tsz Wo Nicholas Sze added a comment - There may be multiple failures in some cases. It requires more datanodes, otherwise, it will fail on adding a datanode to the write pipeline. The SocketTimeoutException definitely will cause multiple failures. I am not sure if there are other cases.
        Hide
        Todd Lipcon added a comment -

        hmm, so what's happening is:

        • we have a pipeline of 3
        • one fails, and it recruits DN #4
        • another fails, and it can't recruit any other DN, so test fails?

        If so, that seems like a problem with the new "recruit a new datanode" feature – if it can't find any new datanode, but it still has at least fs.replication.min replicas in the pipeline, shouldn't it continue with the reduced pipeline?

        Show
        Todd Lipcon added a comment - hmm, so what's happening is: we have a pipeline of 3 one fails, and it recruits DN #4 another fails, and it can't recruit any other DN, so test fails? If so, that seems like a problem with the new "recruit a new datanode" feature – if it can't find any new datanode, but it still has at least fs.replication.min replicas in the pipeline, shouldn't it continue with the reduced pipeline?
        Hide
        Tsz Wo Nicholas Sze added a comment -

        The "recruit a new datanode" feature is triggered according to the configured policy and it can be disabled. If the client enables it and configures a policy, we should honor that.

        Show
        Tsz Wo Nicholas Sze added a comment - The "recruit a new datanode" feature is triggered according to the configured policy and it can be disabled. If the client enables it and configures a policy, we should honor that.
        Hide
        Tsz Wo Nicholas Sze added a comment -

        Todd, so do you think the patch is good?

        Show
        Tsz Wo Nicholas Sze added a comment - Todd, so do you think the patch is good?
        Hide
        Todd Lipcon added a comment -

        Hey Nicholas. Those semantics are kind of surprising to me. As I remember the motivation for the feature, the idea is to reduce the likelihood that a long-running pipeline completely fails. Given that, it seems preferable to continue with 2 or 1 replicas in the pipeline if there are no more nodes left in the cluster, rather than fail if it can't recruit a new one.

        That is to say, if you had a 4-node cluster with the feature enabled, it's actually going to have pipeline failures more often with the feature enabled than with the feature disabled, statistically speaking.

        That said, that's a separate question than this patch. So, if you believe the feature is working as designed, then +1.

        Show
        Todd Lipcon added a comment - Hey Nicholas. Those semantics are kind of surprising to me. As I remember the motivation for the feature, the idea is to reduce the likelihood that a long-running pipeline completely fails. Given that, it seems preferable to continue with 2 or 1 replicas in the pipeline if there are no more nodes left in the cluster, rather than fail if it can't recruit a new one. That is to say, if you had a 4-node cluster with the feature enabled, it's actually going to have pipeline failures more often with the feature enabled than with the feature disabled, statistically speaking. That said, that's a separate question than this patch. So, if you believe the feature is working as designed, then +1.
        Hide
        Tsz Wo Nicholas Sze added a comment -

        > That is to say, if you had a 4-node cluster with the feature enabled, it's actually going to have pipeline failures more often with the feature enabled than with the feature disabled, statistically speaking.

        Even the feature is enabled, individual users can set the policy to NEVER so that it won't trigger it. The feature promises adding datanodes, if a user configures it but datanodes cannot be added for some reason, I think we should fail the operation.

        Thanks for the review.

        Show
        Tsz Wo Nicholas Sze added a comment - > That is to say, if you had a 4-node cluster with the feature enabled, it's actually going to have pipeline failures more often with the feature enabled than with the feature disabled, statistically speaking. Even the feature is enabled, individual users can set the policy to NEVER so that it won't trigger it. The feature promises adding datanodes, if a user configures it but datanodes cannot be added for some reason, I think we should fail the operation. Thanks for the review.
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-trunk-Commit #715 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/715/)
        HDFS-1923. In TestFiDataTransferProtocol2, reduce random sleep time period and increase the number of datanodes.

        szetszwo : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1132698
        Files :

        • /hadoop/hdfs/trunk/src/test/aop/org/apache/hadoop/hdfs/server/datanode/TestFiDataTransferProtocol2.java
        • /hadoop/hdfs/trunk/CHANGES.txt
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #715 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/715/ ) HDFS-1923 . In TestFiDataTransferProtocol2, reduce random sleep time period and increase the number of datanodes. szetszwo : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1132698 Files : /hadoop/hdfs/trunk/src/test/aop/org/apache/hadoop/hdfs/server/datanode/TestFiDataTransferProtocol2.java /hadoop/hdfs/trunk/CHANGES.txt
        Hide
        Tsz Wo Nicholas Sze added a comment -

        I have committed this.

        Show
        Tsz Wo Nicholas Sze added a comment - I have committed this.
        Tsz Wo Nicholas Sze made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Hadoop Flags [Reviewed]
        Fix Version/s 0.23.0 [ 12315571 ]
        Resolution Fixed [ 1 ]
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-trunk #699 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/699/)

        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #699 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/699/ )
        Arun C Murthy made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Patch Available Patch Available
        16d 38m 1 Tsz Wo Nicholas Sze 28/May/11 02:19
        Patch Available Patch Available Resolved Resolved
        9d 16h 2m 1 Tsz Wo Nicholas Sze 06/Jun/11 18:22
        Resolved Resolved Closed Closed
        161d 7h 30m 1 Arun C Murthy 15/Nov/11 00:53

          People

          • Assignee:
            Tsz Wo Nicholas Sze
            Reporter:
            Matt Foley
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development