Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.22.0
    • Fix Version/s: 0.20.205.0, 0.22.0, 0.23.0
    • Component/s: ipc
    • Labels:
      None

      Description

      Currently Hadoop RPC does not timeout when the RPC server is alive. What it currently does is that a RPC client sends a ping to the server whenever a socket timeout happens. If the server is still alive, it continues to wait instead of throwing a SocketTimeoutException. This is to avoid a client to retry when a server is busy and thus making the server even busier. This works great if the RPC server is NameNode.

      But Hadoop RPC is also used for some of client to DataNode communications, for example, for getting a replica's length. When a client comes across a problematic DataNode, it gets stuck and can not switch to a different DataNode. In this case, it would be better that the client receives a timeout exception.

      I plan to add a new configuration ipc.client.max.pings that specifies the max number of pings that a client could try. If a response can not be received after the specified max number of pings, a SocketTimeoutException is thrown. If this configuration property is not set, a client maintains the current semantics, waiting forever.

      1. HADOOP-6889-fortrunk-2.patch
        11 kB
        John George
      2. HADOOP-6889-fortrunk.patch
        11 kB
        John George
      3. HADOOP-6889-for-20security.patch
        12 kB
        John George
      4. HADOOP-6889-for20.3.patch
        28 kB
        Matt Foley
      5. HADOOP-6889-for20.2.patch
        28 kB
        Ravi Prakash
      6. HADOOP-6889-for20.patch
        24 kB
        John George
      7. HADOOP-6889.patch
        24 kB
        John George
      8. ipcTimeout2.patch
        23 kB
        Hairong Kuang
      9. ipcTimeout1.patch
        23 kB
        Hairong Kuang
      10. ipcTimeout.patch
        22 kB
        Hairong Kuang

        Issue Links

          Activity

          Hide
          Matt Foley added a comment -

          Closed upon release of 0.20.205.0

          Show
          Matt Foley added a comment - Closed upon release of 0.20.205.0
          Hide
          John George added a comment -

          Resolving as fixed since this was fixed in all the expected target versions.

          Show
          John George added a comment - Resolving as fixed since this was fixed in all the expected target versions.
          Hide
          John George added a comment -

          Removing 20append from Target Versions since we have not heard back from Hairong. If needed in 20-append, please feel free to re-open this JIRA.

          Show
          John George added a comment - Removing 20append from Target Versions since we have not heard back from Hairong. If needed in 20-append, please feel free to re-open this JIRA.
          Hide
          Matt Foley added a comment -

          If no one is going to port this to 0.20-append, we should remove that version from the "Target Versions" list and close this jira.

          Show
          Matt Foley added a comment - If no one is going to port this to 0.20-append, we should remove that version from the "Target Versions" list and close this jira.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          I think we should port this to 0.20-append. Otherwise the branch is unusable.

          Show
          Tsz Wo Nicholas Sze added a comment - I think we should port this to 0.20-append. Otherwise the branch is unusable.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk #802 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/802/)
          HDFS-1330 and HADOOP-6889. Added additional unit tests. Contributed by John George.

          mattf : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1163463
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSClientRetries.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestInterDatanodeProtocol.java
          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #802 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/802/ ) HDFS-1330 and HADOOP-6889 . Added additional unit tests. Contributed by John George. mattf : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1163463 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSClientRetries.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestInterDatanodeProtocol.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #777 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/777/)
          HDFS-1330 and HADOOP-6889. Added additional unit tests. Contributed by John George.

          mattf : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1163463
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSClientRetries.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestInterDatanodeProtocol.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #777 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/777/ ) HDFS-1330 and HADOOP-6889 . Added additional unit tests. Contributed by John George. mattf : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1163463 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSClientRetries.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestInterDatanodeProtocol.java
          Hide
          Matt Foley added a comment -

          Hairong, can we close this, or do you want it held open for a corresponding patch to 0.20-append?
          Thanks.

          Show
          Matt Foley added a comment - Hairong, can we close this, or do you want it held open for a corresponding patch to 0.20-append? Thanks.
          Hide
          Matt Foley added a comment -

          +1 for code review. Thanks, John!
          Committed to trunk.

          Also asked Arun if he wanted this in branch-0.23, he said yes.
          Committed to v23.

          Show
          Matt Foley added a comment - +1 for code review. Thanks, John! Committed to trunk. Also asked Arun if he wanted this in branch-0.23, he said yes. Committed to v23.
          Hide
          Matt Foley added a comment -

          Submitted patch to HDFS-1330 to get correct run of test-patch. Passed; please see HDFS-1330 for details.

          Show
          Matt Foley added a comment - Submitted patch to HDFS-1330 to get correct run of test-patch. Passed; please see HDFS-1330 for details.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk-Commit #821 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/821/)
          HDFS-1330 and HADOOP-6889. Added additional unit tests. Contributed by John George.

          mattf : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1163463
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSClientRetries.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestInterDatanodeProtocol.java
          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #821 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/821/ ) HDFS-1330 and HADOOP-6889 . Added additional unit tests. Contributed by John George. mattf : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1163463 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSClientRetries.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestInterDatanodeProtocol.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk-Commit #888 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/888/)
          HDFS-1330 and HADOOP-6889. Added additional unit tests. Contributed by John George.

          mattf : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1163463
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSClientRetries.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestInterDatanodeProtocol.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #888 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/888/ ) HDFS-1330 and HADOOP-6889 . Added additional unit tests. Contributed by John George. mattf : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1163463 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSClientRetries.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestInterDatanodeProtocol.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Common-trunk-Commit #811 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/811/)
          HDFS-1330 and HADOOP-6889. Added additional unit tests. Contributed by John George.

          mattf : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1163463
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSClientRetries.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestInterDatanodeProtocol.java
          Show
          Hudson added a comment - Integrated in Hadoop-Common-trunk-Commit #811 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/811/ ) HDFS-1330 and HADOOP-6889 . Added additional unit tests. Contributed by John George. mattf : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1163463 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSClientRetries.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestInterDatanodeProtocol.java
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12491848/HADOOP-6889-fortrunk-2.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 8 new or modified tests.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/91//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12491848/HADOOP-6889-fortrunk-2.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 8 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/91//console This message is automatically generated.
          Hide
          John George added a comment -

          Changing timeout in TestDFSClientRetries to 500 rather than 1000

          Show
          John George added a comment - Changing timeout in TestDFSClientRetries to 500 rather than 1000
          Hide
          Matt Foley added a comment -

          That was a +1 on the patch for branch-0.20-security.

          Show
          Matt Foley added a comment - That was a +1 on the patch for branch-0.20-security.
          Hide
          Matt Foley added a comment -

          Unit test update committed to branch-0.20-security. Thanks, John!

          Show
          Matt Foley added a comment - Unit test update committed to branch-0.20-security. Thanks, John!
          Hide
          Matt Foley added a comment -

          John, please re-submit the trunk patch to trigger CI build.
          Also, for consistency, could you please set the timeout in TestDFSClientRetries to 500 rather than 1000, to match the other test cases, unless there's a reason it needs to be different?
          Thanks.

          Show
          Matt Foley added a comment - John, please re-submit the trunk patch to trigger CI build. Also, for consistency, could you please set the timeout in TestDFSClientRetries to 500 rather than 1000, to match the other test cases, unless there's a reason it needs to be different? Thanks.
          Hide
          Matt Foley added a comment -

          Regarding patchfile names: Please note that all these attachments have been targeted for branch-0.20-security, despite their unfortunate names (I also made a naming error with "20.3"):

          Show
          Matt Foley added a comment - Regarding patchfile names: Please note that all these attachments have been targeted for branch-0.20-security, despite their unfortunate names (I also made a naming error with "20.3"): HADOOP-6889 .patch HADOOP-6889 -for20.patch HADOOP-6889 -for20.2.patch HADOOP-6889 -for20.3.patch HADOOP-6889 -for-20security.patch. The fact that all these are branch-0.20-security patches is clear from the comments accompanying each submission.
          Hide
          John George added a comment -

          The tests included with the previous patch was not testing the exact corresponding change. Hence, a new patch for tests are included (both for trunk and branch-20-security):

          -----------------------------------------
          branch-20-security patch-test results as follows:

          [exec] +1 overall.
          [exec]
          [exec] +1 @author. The patch does not contain any @author tags.
          [exec]
          [exec] +1 tests included. The patch appears to include 7 new or modified tests.
          [exec]
          [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
          [exec]
          [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
          [exec]
          [exec] +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
          [exec]

          ---------------------------------------
          trunk patch-test results:
          It was failing on empty patch as well for release audit and system test framework.

          [exec] BUILD FAILED
          [exec] /home/johngeo/hadoop/trunk/hdfs/src/test/aop/build/aop.xml:222: The following error occurred while executing this nclude 8 new or modified tests.
          [exec]
          [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
          [exec]
          [exec] +1 javac. The apline:
          [exec] /home/johngeo/hadoop/trunk/hdfs/src/test/aop/build/aop.xml:203: The following error occurred while executing this line:
          [exec] /hplied patch does not increase the total number of javac compiler warnings.
          [exec]
          [exec] +1 findbugs. The patch does not introduce any nome/johngeo/hadoop/trunk/hdfs/src/test/aop/build/aop.xml:90: compile errors: 18
          [exec]
          [exec] Total time: 35 seconds
          [exec] ew Findbugs (version 1.3.9) warnings.
          [exec]
          [exec] -1 release audit. The applied patch generated 2 release audit warnings (more than the trunk's current 0 warnings).
          [exec]
          [exec] -1 system test framework. The patch failed system test framework compile.

          Show
          John George added a comment - The tests included with the previous patch was not testing the exact corresponding change. Hence, a new patch for tests are included (both for trunk and branch-20-security): ----------------------------------------- branch-20-security patch-test results as follows: [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 7 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. [exec] --------------------------------------- trunk patch-test results: It was failing on empty patch as well for release audit and system test framework. [exec] BUILD FAILED [exec] /home/johngeo/hadoop/trunk/hdfs/src/test/aop/build/aop.xml:222: The following error occurred while executing this nclude 8 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The apline: [exec] /home/johngeo/hadoop/trunk/hdfs/src/test/aop/build/aop.xml:203: The following error occurred while executing this line: [exec] /hplied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any nome/johngeo/hadoop/trunk/hdfs/src/test/aop/build/aop.xml:90: compile errors: 18 [exec] [exec] Total time: 35 seconds [exec] ew Findbugs (version 1.3.9) warnings. [exec] [exec] -1 release audit. The applied patch generated 2 release audit warnings (more than the trunk's current 0 warnings). [exec] [exec] -1 system test framework. The patch failed system test framework compile.
          Hide
          Matt Foley added a comment -

          +1 for code review. Committed to 0.20-security.

          Ravi / John, thanks for adding those two new test cases, they are just what the doctor ordered.

          Can you please port them up to v23 trunk, before we close this jira? Thanks.

          Show
          Matt Foley added a comment - +1 for code review. Committed to 0.20-security. Ravi / John, thanks for adding those two new test cases, they are just what the doctor ordered. Can you please port them up to v23 trunk, before we close this jira? Thanks.
          Hide
          Matt Foley added a comment -

          Patch is for 0.20-security branch.

          Show
          Matt Foley added a comment - Patch is for 0.20-security branch.
          Hide
          Ravi Prakash added a comment -

          Hi Matt, Thanks for the changes! Yes I ran the tests again and they all passed. Please commit the patch.

          +1

          Show
          Ravi Prakash added a comment - Hi Matt, Thanks for the changes! Yes I ran the tests again and they all passed. Please commit the patch. +1
          Hide
          Matt Foley added a comment -

          Thanks Ravi. The following things needed cleanup:

          • Comments for the two new methods needed fixing after copy from earlier source.
          • Wrong LOG stream in TestDFSClientRetries.testDFSClientTimeout().
          • Spurious imports (probably snuck in due to Eclipse).

          Please review attached, and re-run the test. If okay with you we can commit.

          Show
          Matt Foley added a comment - Thanks Ravi. The following things needed cleanup: Comments for the two new methods needed fixing after copy from earlier source. Wrong LOG stream in TestDFSClientRetries.testDFSClientTimeout(). Spurious imports (probably snuck in due to Eclipse). Please review attached, and re-run the test. If okay with you we can commit.
          Hide
          Ravi Prakash added a comment -

          Results from test-patch

               [exec] +1 overall.  
               [exec] 
               [exec]     +1 @author.  The patch does not contain any @author tags.
               [exec] 
               [exec]     +1 tests included.  The patch appears to include 12 new or modified tests.
               [exec] 
               [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
               [exec] 
               [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
               [exec] 
               [exec]     +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.
               [exec] 
               [exec] 
               [exec] 
               [exec] 
               [exec] ======================================================================
               [exec] ======================================================================
               [exec]     Finished build.
               [exec] ======================================================================
               [exec] ======================================================================
          
          
          Show
          Ravi Prakash added a comment - Results from test-patch [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 12 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. [exec] [exec] [exec] [exec] [exec] ====================================================================== [exec] ====================================================================== [exec] Finished build. [exec] ====================================================================== [exec] ======================================================================
          Hide
          Ravi Prakash added a comment -

          Thanks for your review and comments Matt.

          1. I see a single new test case, TestIPC.testIpcTimeout(), that tests the lowest-level timeout functionality, between a client and a TestServer server. However, I do not see any test cases that check whether the integration of that timeout functionality with, eg, the InterDatanodeProtocol works as expected. (The mod to TestInterDatanodeProtocol merely adapts to the change, it does not test the change.) Similarly, no test of timeout in the context of DFSClient with a MiniDFSCluster. Granted the original patch to trunk doesn't test these either. But do you feel confident in the patch without such additional tests, and why?


          I'm uploading a new patch with the added tests on behalf of John George.

          2. Are the variances between the trunk and v20 patches due only to code tree divergence, or are there changes added to the v20 patch that are not in v23 and perhaps should be? Thanks.

          John told me the variances are indeed only because of the tree divergence.

          Show
          Ravi Prakash added a comment - Thanks for your review and comments Matt. 1. I see a single new test case, TestIPC.testIpcTimeout(), that tests the lowest-level timeout functionality, between a client and a TestServer server. However, I do not see any test cases that check whether the integration of that timeout functionality with, eg, the InterDatanodeProtocol works as expected. (The mod to TestInterDatanodeProtocol merely adapts to the change, it does not test the change.) Similarly, no test of timeout in the context of DFSClient with a MiniDFSCluster. Granted the original patch to trunk doesn't test these either. But do you feel confident in the patch without such additional tests, and why? I'm uploading a new patch with the added tests on behalf of John George. 2. Are the variances between the trunk and v20 patches due only to code tree divergence, or are there changes added to the v20 patch that are not in v23 and perhaps should be? Thanks. John told me the variances are indeed only because of the tree divergence.
          Hide
          Matt Foley added a comment -

          Unfortunately this patch diverges a lot from the trunk patch (presumably because of 0.20/0.23 code tree divergence, of course), so I could not usefully diff the patches and had to review this like a new patch.

          In terms of code review, I found no problems. But it's a large enough patch that we are dependent on thorough unit testing to be confident in the patch. So I have two questions:

          1. I see a single new test case, TestIPC.testIpcTimeout(), that tests the lowest-level timeout functionality, between a client and a TestServer server. However, I do not see any test cases that check whether the integration of that timeout functionality with, eg, the InterDatanodeProtocol works as expected. (The mod to TestInterDatanodeProtocol merely adapts to the change, it does not test the change.) Similarly, no test of timeout in the context of DFSClient with a MiniDFSCluster. Granted the original patch to trunk doesn't test these either. But do you feel confident in the patch without such additional tests, and why?

          2. Are the variances between the trunk and v20 patches due only to code tree divergence, or are there changes added to the v20 patch that are not in v23 and perhaps should be? Thanks.

          Show
          Matt Foley added a comment - Unfortunately this patch diverges a lot from the trunk patch (presumably because of 0.20/0.23 code tree divergence, of course), so I could not usefully diff the patches and had to review this like a new patch. In terms of code review, I found no problems. But it's a large enough patch that we are dependent on thorough unit testing to be confident in the patch. So I have two questions: 1. I see a single new test case, TestIPC.testIpcTimeout(), that tests the lowest-level timeout functionality, between a client and a TestServer server. However, I do not see any test cases that check whether the integration of that timeout functionality with, eg, the InterDatanodeProtocol works as expected. (The mod to TestInterDatanodeProtocol merely adapts to the change, it does not test the change.) Similarly, no test of timeout in the context of DFSClient with a MiniDFSCluster. Granted the original patch to trunk doesn't test these either. But do you feel confident in the patch without such additional tests, and why? 2. Are the variances between the trunk and v20 patches due only to code tree divergence, or are there changes added to the v20 patch that are not in v23 and perhaps should be? Thanks.
          Hide
          Matt Foley added a comment -

          @Uma, I agree that the new param you suggest should be dealt with in a separate ticket, not in this long-closed ticket, which has been re-opened just for a back-port to the sustaining branch. Thanks.

          Show
          Matt Foley added a comment - @Uma, I agree that the new param you suggest should be dealt with in a separate ticket, not in this long-closed ticket, which has been re-opened just for a back-port to the sustaining branch. Thanks.
          Hide
          Matt Foley added a comment -

          I just checked and confirmed that HADOOP-6907 was submitted to 0.20-security sustaining "trunk" just before the 20.203 branch was created. Updated the "Fix Versions" of HADOOP-6907 to reflect that correctly.

          Show
          Matt Foley added a comment - I just checked and confirmed that HADOOP-6907 was submitted to 0.20-security sustaining "trunk" just before the 20.203 branch was created. Updated the "Fix Versions" of HADOOP-6907 to reflect that correctly.
          Hide
          John George added a comment -

          Thanks for the pointer Suresh.
          Seems like HADOOP-6907 is already in branch-0.20-security.

          Show
          John George added a comment - Thanks for the pointer Suresh. Seems like HADOOP-6907 is already in branch-0.20-security.
          Hide
          Suresh Srinivas added a comment -

          John, there was a related change - HADOOP-6907, are you planning to port that to 0.20.s as well?

          Show
          Suresh Srinivas added a comment - John, there was a related change - HADOOP-6907 , are you planning to port that to 0.20.s as well?
          Hide
          John George added a comment -

          hey Uma, I have responded to your questions in HDFS-1880

          Show
          John George added a comment - hey Uma, I have responded to your questions in HDFS-1880
          Hide
          Uma Maheswara Rao G added a comment -

          Hi John,

          I have seen waitForProxy is passing 0 as rpcTimeOut. It is hardcoded value.

          return waitForProtocolProxy(protocol, clientVersion, addr, conf, 0, connTimeout);
          

          If user wants to control this value then , how can he configure?

          Here we have a situation, where clients are waiting for long time.HDFS-1880.

          I thought, HADOOP-6889 can solve that problem. But how this can be controlled by the user in Hadoop (looks no configuration parameters available).

          I plan to add a new configuration ipc.client.max.pings that specifies the max number of pings that a client could try. If a response can not be received after the specified max number of pings, a SocketTimeoutException is thrown. If this configuration property is not set, a client maintains the current semantics, waiting forever.

          We have choosen this implementation for our cluster.

          I am just checking , whether i can use rpcTimeOut itself to control. ( since this change already committed).

          Can you please clarify more?

          Can you just check HDFS-1880.

          @Hairong

          I thought about introducing a configuration parameter. But clients or DataNodes want to have timeout for RPCs to DataNodes but no timeout for RPCs to NameNodes. Adding a rpcTimeout parameter makes this easy.

          I think considering HA, clients and NameNode also requires some timeout.
          If Active goes down, then clients should not wait in timeouts right?

          Show
          Uma Maheswara Rao G added a comment - Hi John, I have seen waitForProxy is passing 0 as rpcTimeOut. It is hardcoded value. return waitForProtocolProxy(protocol, clientVersion, addr, conf, 0, connTimeout); If user wants to control this value then , how can he configure? Here we have a situation, where clients are waiting for long time. HDFS-1880 . I thought, HADOOP-6889 can solve that problem. But how this can be controlled by the user in Hadoop (looks no configuration parameters available). I plan to add a new configuration ipc.client.max.pings that specifies the max number of pings that a client could try. If a response can not be received after the specified max number of pings, a SocketTimeoutException is thrown. If this configuration property is not set, a client maintains the current semantics, waiting forever. We have choosen this implementation for our cluster. I am just checking , whether i can use rpcTimeOut itself to control. ( since this change already committed). Can you please clarify more? Can you just check HDFS-1880 . @Hairong I thought about introducing a configuration parameter. But clients or DataNodes want to have timeout for RPCs to DataNodes but no timeout for RPCs to NameNodes. Adding a rpcTimeout parameter makes this easy. I think considering HA, clients and NameNode also requires some timeout. If Active goes down, then clients should not wait in timeouts right?
          Hide
          John George added a comment -

          Output of test-patch on my desktop:

          [exec] [findbugs] Classes needed for analysis were missing

          [exec] [findbugs] Output saved to /home/johngeo/hadoop/branch-0.20-security/build/test/findbugs/hadoop-findbugs-report.xml

          [exec] [xslt] Processing /home/johngeo/hadoop/branch-0.20-security/build/test/findbugs/hadoop-findbugs-report.xml to /home/johngeo/hadoop/branch-0.20-security/build/test/findbugs/hadoop-findbugs-report.html

          [exec] [xslt] Loading stylesheet /hadoop/config/findbugs-1.3.9/src/xsl/default.xsl

          [exec]

          [exec] BUILD SUCCESSFUL

          [exec] Total time: 6 minutes 33 seconds

          [exec]

          [exec]

          [exec]

          [exec]

          [exec] +1 overall.

          [exec]

          [exec] +1 @author. The patch does not contain any @author tags.

          [exec]

          [exec] +1 tests included. The patch appears to include 9 new or modified tests.

          [exec]

          [exec] +1 javadoc. The javadoc tool did not generate any warning messages.

          [exec]

          [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          [exec]

          [exec] +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          [exec]

          [exec]

          [exec]

          [exec]

          [exec] ======================================================================

          [exec] ======================================================================

          [exec] Finished build.

          [exec] ======================================================================

          [exec] ======================================================================

          [exec]

          [exec]

          Show
          John George added a comment - Output of test-patch on my desktop: [exec] [findbugs] Classes needed for analysis were missing [exec] [findbugs] Output saved to /home/johngeo/hadoop/branch-0.20-security/build/test/findbugs/hadoop-findbugs-report.xml [exec] [xslt] Processing /home/johngeo/hadoop/branch-0.20-security/build/test/findbugs/hadoop-findbugs-report.xml to /home/johngeo/hadoop/branch-0.20-security/build/test/findbugs/hadoop-findbugs-report.html [exec] [xslt] Loading stylesheet /hadoop/config/findbugs-1.3.9/src/xsl/default.xsl [exec] [exec] BUILD SUCCESSFUL [exec] Total time: 6 minutes 33 seconds [exec] [exec] [exec] [exec] [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 9 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. [exec] [exec] [exec] [exec] [exec] ====================================================================== [exec] ====================================================================== [exec] Finished build. [exec] ====================================================================== [exec] ====================================================================== [exec] [exec]
          Hide
          John George added a comment -

          It was not my intention to remove "0.20-append" and "0.22" form the Fix Versions. So adding it back...

          Show
          John George added a comment - It was not my intention to remove "0.20-append" and "0.22" form the Fix Versions. So adding it back...
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12487319/HADOOP-6889-for20.patch
          against trunk revision 1148933.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 9 new or modified tests.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/756//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12487319/HADOOP-6889-for20.patch against trunk revision 1148933. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/756//console This message is automatically generated.
          Hide
          John George added a comment -

          Attaching the patch with new name so that it is obvious it is for .20

          Show
          John George added a comment - Attaching the patch with new name so that it is obvious it is for .20
          Hide
          John George added a comment -

          Attaching a patch for .20-branch-security. Could someone please review this?

          Show
          John George added a comment - Attaching a patch for .20-branch-security. Could someone please review this?
          Hide
          John George added a comment -

          I would like to port this to branch-20-security.

          Show
          John George added a comment - I would like to port this to branch-20-security.
          Hide
          Uma Maheswara Rao G added a comment -

          Can you please clarify more?

          Show
          Uma Maheswara Rao G added a comment - Can you please clarify more?
          Hide
          Uma Maheswara Rao G added a comment -

          Hay Hairong,

          I have seen waitForProxy is passing 0 as rpcTimeOut. It is hardcoded value.

          
          return waitForProtocolProxy(protocol, clientVersion, addr, conf, 0, connTimeout);
          
          

          If user wants to control this value then , how can he configure?

          Here we have a situation, where clients are waiting for long time.HDFS-1880.

          I thought, this issue can solve that problem. But how this can be controlled by the user in Hadoop.

          I plan to add a new configuration ipc.client.max.pings that specifies the max number of pings that a client could try. If a response can not be received after the specified max number of pings, a SocketTimeoutException is thrown. If this configuration property is not set, a client maintains the current semantics, waiting forever.

          We have choosen this implementation for our cluster.

          I am just checking , whether i can use rpcTimeOut itself to control. ( since this change already committed).

          Can you please clarify more?

          Show
          Uma Maheswara Rao G added a comment - Hay Hairong, I have seen waitForProxy is passing 0 as rpcTimeOut. It is hardcoded value. return waitForProtocolProxy(protocol, clientVersion, addr, conf, 0, connTimeout); If user wants to control this value then , how can he configure? Here we have a situation, where clients are waiting for long time. HDFS-1880 . I thought, this issue can solve that problem. But how this can be controlled by the user in Hadoop. I plan to add a new configuration ipc.client.max.pings that specifies the max number of pings that a client could try. If a response can not be received after the specified max number of pings, a SocketTimeoutException is thrown. If this configuration property is not set, a client maintains the current semantics, waiting forever. We have choosen this implementation for our cluster. I am just checking , whether i can use rpcTimeOut itself to control. ( since this change already committed). Can you please clarify more?
          Hide
          Kan Zhang added a comment -

          This patch's hashcode() implementation also changed the UGI semantics. It used to be that two UGI's having the same subject object are equal. Now they are not equal. I hope to fix both issues in HADOOP-6907.

          Show
          Kan Zhang added a comment - This patch's hashcode() implementation also changed the UGI semantics. It used to be that two UGI's having the same subject object are equal. Now they are not equal. I hope to fix both issues in HADOOP-6907 .
          Hide
          Hairong Kuang added a comment -

          It looks that Java returns the same value for every call of System.indentifyHashcode as long as int x remains the same. But I feel more comfortable using x as hashcode directly.

          Either ^ or + should be fine. But I think we should consistently use either ^ or +.

          Show
          Hairong Kuang added a comment - It looks that Java returns the same value for every call of System.indentifyHashcode as long as int x remains the same. But I feel more comfortable using x as hashcode directly. Either ^ or + should be fine. But I think we should consistently use either ^ or +.
          Hide
          Kan Zhang added a comment -

          If we can confirm Java always uses the same object for the same integer, your code is correct. Otherwise, I'm fine with using rpcTimeout as hashcode. I've seen recommendations on computing hashcode iteratively as follows.

          hash = PRIME * hash + component_value
          

          Seems we are doing ^ instead of +. Does it matter?

          Show
          Kan Zhang added a comment - If we can confirm Java always uses the same object for the same integer, your code is correct. Otherwise, I'm fine with using rpcTimeout as hashcode. I've seen recommendations on computing hashcode iteratively as follows. hash = PRIME * hash + component_value Seems we are doing ^ instead of +. Does it matter?
          Hide
          Hairong Kuang added a comment -

          Kan, thanks for your insight. It seems that you are right, identifyHashcode is based on references.

          Do you mind if we simply use rpcTime as the hashcode?

          Show
          Hairong Kuang added a comment - Kan, thanks for your insight. It seems that you are right, identifyHashcode is based on references. Do you mind if we simply use rpcTime as the hashcode?
          Hide
          Kan Zhang added a comment -

          PS. Maybe Java optimizes on reusing the same object for the same integer. I don't know if this is something we can depend on.

          Show
          Kan Zhang added a comment - PS. Maybe Java optimizes on reusing the same object for the same integer. I don't know if this is something we can depend on.
          Hide
          Kan Zhang added a comment -

          > I think you can simply use rpcTimeout itself as the hashcode.
          If that's what you intended. I haven't considered its randomness.

          Show
          Kan Zhang added a comment - > I think you can simply use rpcTimeout itself as the hashcode. If that's what you intended. I haven't considered its randomness.
          Hide
          Kan Zhang added a comment -
          -    @Override
          +    @Override  // simply use the default Object#hashcode() ?
               public int hashCode() {
          -      return (address.hashCode() + PRIME * System.identityHashCode(protocol)) ^ 
          -             (ticket == null ? 0 : ticket.hashCode());
          +      return (address.hashCode() + PRIME * (
          +                PRIME * (
          +                  PRIME * System.identityHashCode(protocol) ^
          +                  System.identityHashCode(ticket)
          +                ) ^ System.identityHashCode(rpcTimeout)
          +              ));
          

          I wonder if System.identityHashCode(rpcTimeout) is correct in the above code. I think Java does an autoboxing to get an Integer object and what identityHashCode() returns is the address of this object (it doesn't call the object's hashcode() method). So even if two ConnectionId objects have the same rpcTimeout you will get two different hashcodes. I think you can simply use rpcTimeout itself as the hashcode.

          Show
          Kan Zhang added a comment - - @Override + @Override // simply use the default Object #hashcode() ? public int hashCode() { - return (address.hashCode() + PRIME * System .identityHashCode(protocol)) ^ - (ticket == null ? 0 : ticket.hashCode()); + return (address.hashCode() + PRIME * ( + PRIME * ( + PRIME * System .identityHashCode(protocol) ^ + System .identityHashCode(ticket) + ) ^ System .identityHashCode(rpcTimeout) + )); I wonder if System.identityHashCode(rpcTimeout) is correct in the above code. I think Java does an autoboxing to get an Integer object and what identityHashCode() returns is the address of this object (it doesn't call the object's hashcode() method). So even if two ConnectionId objects have the same rpcTimeout you will get two different hashcodes. I think you can simply use rpcTimeout itself as the hashcode.
          Hide
          Kan Zhang added a comment -

          I see.

          Show
          Kan Zhang added a comment - I see.
          Hide
          Hairong Kuang added a comment -

          Kan, thanks for taking a look at the patch!

          I thought about introducing a configuration parameter. But clients or DataNodes want to have timeout for RPCs to DataNodes but no timeout for RPCs to NameNodes. Adding a rpcTimeout parameter makes this easy.

          Show
          Hairong Kuang added a comment - Kan, thanks for taking a look at the patch! I thought about introducing a configuration parameter. But clients or DataNodes want to have timeout for RPCs to DataNodes but no timeout for RPCs to NameNodes. Adding a rpcTimeout parameter makes this easy.
          Hide
          Kan Zhang added a comment -

          I know I'm a little late to the party. But I'm just wondering if it's better to pass rcpTimeout using the existing conf param rather than adding a new parameter to getProxy()?

          Show
          Kan Zhang added a comment - I know I'm a little late to the party. But I'm just wondering if it's better to pass rcpTimeout using the existing conf param rather than adding a new parameter to getProxy()?
          Hide
          Hairong Kuang added a comment -

          I've just committed this!

          Show
          Hairong Kuang added a comment - I've just committed this!
          Hide
          Hairong Kuang added a comment -

          Hudson is down. I ran ant clean test on my linux box twice. They all passed:

          BUILD SUCCESSFUL
          Total time: 11 minutes 8 seconds

          BUILD SUCCESSFUL
          Total time: 9 minutes 55 seconds

          Show
          Hairong Kuang added a comment - Hudson is down. I ran ant clean test on my linux box twice. They all passed: BUILD SUCCESSFUL Total time: 11 minutes 8 seconds BUILD SUCCESSFUL Total time: 9 minutes 55 seconds
          Hide
          Hairong Kuang added a comment -

          Resubmit since the previous one seemed to stuck.

          Show
          Hairong Kuang added a comment - Resubmit since the previous one seemed to stuck.
          Hide
          Hairong Kuang added a comment -

          This fixed the failed tests.

          Show
          Hairong Kuang added a comment - This fixed the failed tests.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12451180/ipcTimeout1.patch
          against trunk revision 981714.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          -1 javadoc. The javadoc tool appears to have generated 1 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/662/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/662/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/662/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/662/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12451180/ipcTimeout1.patch against trunk revision 981714. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 javadoc. The javadoc tool appears to have generated 1 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/662/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/662/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/662/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/662/console This message is automatically generated.
          Hide
          Doug Cutting added a comment -

          +1 this patch looks reasonable to me, pending Hudson's scrutiny.

          Show
          Doug Cutting added a comment - +1 this patch looks reasonable to me, pending Hudson's scrutiny.
          Hide
          Hairong Kuang added a comment -

          ipcTimeout1.patch additionally makes waitForProxy has rpcTimeout as a parameter.

          Show
          Hairong Kuang added a comment - ipcTimeout1.patch additionally makes waitForProxy has rpcTimeout as a parameter.
          Hide
          Hairong Kuang added a comment -

          This patch passes a rpcTimout parameter to RPC#getProxy method. A a non-positive rpcTimeout means that RPC does not timeout as the default behavior. If rpcTimeout is positive, a RPC client throws SocketTimeoutException if the client has not received a response in rpcTimeout period.

          Show
          Hairong Kuang added a comment - This patch passes a rpcTimout parameter to RPC#getProxy method. A a non-positive rpcTimeout means that RPC does not timeout as the default behavior. If rpcTimeout is positive, a RPC client throws SocketTimeoutException if the client has not received a response in rpcTimeout period.
          Hide
          Doug Cutting added a comment -

          As I recall, RPC calls used to have a timeout that, combined with the retry proxy, would spiral load on servers. But I think timeouts when retries go to a different server should not have the same problem.

          Show
          Doug Cutting added a comment - As I recall, RPC calls used to have a timeout that, combined with the retry proxy, would spiral load on servers. But I think timeouts when retries go to a different server should not have the same problem.
          Hide
          Todd Lipcon added a comment -

          Hi Hairong. I agree that it should be per-proxy, and also that this is useful. We have a customer who has occasionally run into this with problematic "stuck" datanodes blocking pipeline recovery indefinitely (at least until ops notices the hung machine and powers it down, causing TCP connection to drop).

          Show
          Todd Lipcon added a comment - Hi Hairong. I agree that it should be per-proxy, and also that this is useful. We have a customer who has occasionally run into this with problematic "stuck" datanodes blocking pipeline recovery indefinitely (at least until ops notices the hung machine and powers it down, causing TCP connection to drop).
          Hide
          Hairong Kuang added a comment -

          On a second thought, it would be nice if RPC proxies could configure differently. For example, one may want to have timeout on RPCs to DataNodes but no timeout on RPCs to NameNode. So I propose to add an additional parameter, maxRpcWaitingTime, to RPC#getProxy, meaning for every RPC call to this proxy, SocketTimeout is thrown if a client has not received a response in maxRpcWaitingTime. If maxRpcWaitingTime is zero, a RPC call will not timeout.

          In this way, a client could configure maxRpcWaitingTime differently for its communicatations to NameNode and DataNodes.

          Show
          Hairong Kuang added a comment - On a second thought, it would be nice if RPC proxies could configure differently. For example, one may want to have timeout on RPCs to DataNodes but no timeout on RPCs to NameNode. So I propose to add an additional parameter, maxRpcWaitingTime, to RPC#getProxy, meaning for every RPC call to this proxy, SocketTimeout is thrown if a client has not received a response in maxRpcWaitingTime. If maxRpcWaitingTime is zero, a RPC call will not timeout. In this way, a client could configure maxRpcWaitingTime differently for its communicatations to NameNode and DataNodes.

            People

            • Assignee:
              John George
              Reporter:
              Hairong Kuang
            • Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development