Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-1787

"Not enough xcievers" error should propagate to client

    Details

    • Type: Improvement Improvement
    • Status: Patch Available
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.23.0
    • Fix Version/s: None
    • Component/s: datanode
    • Release Note:
      This changes the DataTransferProtocol to return a new error code when a max tranfers exceeded messages is encountered.

      Description

      We find that users often run into the default transceiver limits in the DN. Putting aside the inherent issues with xceiver threads, it would be nice if the "xceiver limit exceeded" error propagated to the client. Currently, clients simply see an EOFException which is hard to interpret, and have to go slogging through DN logs to find the underlying issue.

      The data transfer protocol should be extended to either have a special error code for "not enough xceivers" or should have some error code for generic errors with which a string can be attached and propagated.

      1. hdfs-1787.2.patch
        20 kB
        Jonathan Hsieh
      2. hdfs-1787.3.patch
        19 kB
        Jonathan Hsieh
      3. hdfs-1787.3.patch
        20 kB
        Jonathan Hsieh
      4. hdfs-1787.5.patch
        18 kB
        Jonathan Hsieh
      5. hdfs-1787.patch
        20 kB
        Jonathan Hsieh

        Issue Links

          Activity

          Hide
          Jonathan Hsieh added a comment -

          This patch updates the max transfers/xceivers message so that it gets propagated to the dfs client. I was able to write a reasonable test for the write side, but the read side requires a change to hadoop common. FSDataOuptutStream for a the write side has a getWrappedStream method, but the FSDataInputStream class for the read side does not have or expose this.

          Show
          Jonathan Hsieh added a comment - This patch updates the max transfers/xceivers message so that it gets propagated to the dfs client. I was able to write a reasonable test for the write side, but the read side requires a change to hadoop common. FSDataOuptutStream for a the write side has a getWrappedStream method, but the FSDataInputStream class for the read side does not have or expose this.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12478976/hdfs-1787.patch
          against trunk revision 1102153.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.cli.TestHDFSCLI
          org.apache.hadoop.hdfs.TestDFSShell
          org.apache.hadoop.hdfs.TestDFSStorageStateRecovery
          org.apache.hadoop.hdfs.TestFileConcurrentReader
          org.apache.hadoop.tools.TestJMXGet

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/499//testReport/
          Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/499//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/499//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12478976/hdfs-1787.patch against trunk revision 1102153. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 2 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.cli.TestHDFSCLI org.apache.hadoop.hdfs.TestDFSShell org.apache.hadoop.hdfs.TestDFSStorageStateRecovery org.apache.hadoop.hdfs.TestFileConcurrentReader org.apache.hadoop.tools.TestJMXGet +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/499//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/499//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/499//console This message is automatically generated.
          Hide
          Jonathan Hsieh added a comment -

          I will look into these newly failing tests.

          Show
          Jonathan Hsieh added a comment - I will look into these newly failing tests.
          Hide
          Aaron T. Myers added a comment -

          You're encouraged to look at the test failures, but I'm pretty confident they're unrelated to this patch. Those tests are known to be failing on trunk.

          Show
          Aaron T. Myers added a comment - You're encouraged to look at the test failures, but I'm pretty confident they're unrelated to this patch. Those tests are known to be failing on trunk.
          Hide
          Jonathan Hsieh added a comment -

          After more investigation, these two may be newly incurred errors.

          org.apache.hadoop.hdfs.TestDFSStorageStateRecovery
          org.apache.hadoop.hdfs.TestFileConcurrentReader

          The other three seem flaky on trunk.

          Show
          Jonathan Hsieh added a comment - After more investigation, these two may be newly incurred errors. org.apache.hadoop.hdfs.TestDFSStorageStateRecovery org.apache.hadoop.hdfs.TestFileConcurrentReader The other three seem flaky on trunk.
          Hide
          Aaron T. Myers added a comment -

          Patch looks pretty good, Jon. A few comments:

          1. Looks to me like AtomicInteger maxTransfersErrors in BlockReader.java is unused.
          2. Error message "...exceeding max transfers..." in BlockReader.java should only be indented 4 spaces.
          3. In DFSInputStream.java, I think you should log the "Failed to connect..." error message even if a MaxTransfersExceededException is encountered.
          4. In DFSOutputStream.java, is it really necessary that maxTransferErrors be an AtomicInteger instead of an int? Since this is only used for testing, it doesn't seem like a big deal to me if we miss a few increments, as you do in DFSInputStream.java.
          5. I don't understand the "if (errorIndex >= 0) check in DFSOutputStream.java. How is this related?
          6. In DFSOutputStream.java, space between ">=" and "0".
          7. In DataXceiver.java, please only indent two spaces at "throw new MaxTransfersExceededException", and four spaces for the line continuations.
          8. In DataXceiver.java, please indent two sapces at "sendErrorResponse", and put a space before "t.getMessage()" on the same line.
          9. In DataXceiver.java, indent four spaces at "new DataOutputStream(...)".
          10. In TestDFSClientErrorMessages.java - Hadoop uses commons logging, not mortbay for logging output.
            In TestDFSClientErrorMessages.java - there's a bunch of commented-out code and TODOs. Do you intend to implement these as part of this JIRA? Or is this a note for future work?
          Show
          Aaron T. Myers added a comment - Patch looks pretty good, Jon. A few comments: Looks to me like AtomicInteger maxTransfersErrors in BlockReader.java is unused. Error message "...exceeding max transfers..." in BlockReader.java should only be indented 4 spaces. In DFSInputStream.java , I think you should log the "Failed to connect..." error message even if a MaxTransfersExceededException is encountered. In DFSOutputStream.java , is it really necessary that maxTransferErrors be an AtomicInteger instead of an int ? Since this is only used for testing, it doesn't seem like a big deal to me if we miss a few increments, as you do in DFSInputStream.java . I don't understand the " if (errorIndex >= 0) check in DFSOutputStream.java . How is this related? In DFSOutputStream.java , space between " >= " and " 0 ". In DataXceiver.java , please only indent two spaces at " throw new MaxTransfersExceededException ", and four spaces for the line continuations. In DataXceiver.java , please indent two sapces at " sendErrorResponse ", and put a space before " t.getMessage() " on the same line. In DataXceiver.java , indent four spaces at " new DataOutputStream(...) ". In TestDFSClientErrorMessages.java - Hadoop uses commons logging, not mortbay for logging output. In TestDFSClientErrorMessages.java - there's a bunch of commented-out code and TODOs. Do you intend to implement these as part of this JIRA? Or is this a note for future work?
          Hide
          Jonathan Hsieh added a comment -

          #2. Some tabs got in there, reformatting to look like subsequent block blocks.

          #3. Ok, just incrementing the counter in this case, and using the previous more verbose message format.

          #4. It needs to be a reference to a integer because it is being incremented in one thread and read by another. The normal Integer is not really trustable in these situations (ends up using a const) so I chose to use AtomicInteger. In the input case, there is only a single thread. Since this should be a rare error condition, I really wouldn't be concerned about its performance.

          #5. I believe the string that I use has a different purpose than shipping error messages and normally has a node name. I hijacked it. This can incur if it is not a node name, this can incur ArrayOutOfBounds exception (default value is initialized to -1).

          More responses pending.

          Show
          Jonathan Hsieh added a comment - #2. Some tabs got in there, reformatting to look like subsequent block blocks. #3. Ok, just incrementing the counter in this case, and using the previous more verbose message format. #4. It needs to be a reference to a integer because it is being incremented in one thread and read by another. The normal Integer is not really trustable in these situations (ends up using a const) so I chose to use AtomicInteger. In the input case, there is only a single thread. Since this should be a rare error condition, I really wouldn't be concerned about its performance. #5. I believe the string that I use has a different purpose than shipping error messages and normally has a node name. I hijacked it. This can incur if it is not a node name, this can incur ArrayOutOfBounds exception (default value is initialized to -1). More responses pending.
          Hide
          Jonathan Hsieh added a comment -

          #4 update – I need to point out the core reason for the integer reference is that it is referenced by a ResponseProcessor, essentially a private inner class that is in a runs as a separate Thread.

          #10 I have filed a new jira to common (HADOOP-7301) and submitted a patch that exposes the necessary methods to run this test. If and when that gets committed, I'll remove the comments.

          Show
          Jonathan Hsieh added a comment - #4 update – I need to point out the core reason for the integer reference is that it is referenced by a ResponseProcessor, essentially a private inner class that is in a runs as a separate Thread. #10 I have filed a new jira to common ( HADOOP-7301 ) and submitted a patch that exposes the necessary methods to run this test. If and when that gets committed, I'll remove the comments.
          Hide
          Jonathan Hsieh added a comment -

          #5 Rephrase: The string slot in other cases has a different purpose than for shipping error messages. Normally it has a datanode name. I've hijacked it in this particular error case. Without this check, it will attempt to search a node list and eventually fail to find a node name and get an index for the node. errorIndex is initilaized to be -1 which eventually getsused as an index, which will in turn incurs an ArrayOutOfBoundsException. This check prevents that problem.

          Show
          Jonathan Hsieh added a comment - #5 Rephrase: The string slot in other cases has a different purpose than for shipping error messages. Normally it has a datanode name. I've hijacked it in this particular error case. Without this check, it will attempt to search a node list and eventually fail to find a node name and get an index for the node. errorIndex is initilaized to be -1 which eventually getsused as an index, which will in turn incurs an ArrayOutOfBoundsException. This check prevents that problem.
          Hide
          Jonathan Hsieh added a comment -

          #6,#7,#8,#9: done.

          #1 is related to #10/HADOOP-7301.

          Show
          Jonathan Hsieh added a comment - #6,#7,#8,#9: done. #1 is related to #10/ HADOOP-7301 .
          Hide
          Jonathan Hsieh added a comment -

          HADOOP-7301 got committed so updated patch to uncomment TODOs and fix issues and nits suggested by Aaron Myers

          Show
          Jonathan Hsieh added a comment - HADOOP-7301 got committed so updated patch to uncomment TODOs and fix issues and nits suggested by Aaron Myers
          Hide
          Jonathan Hsieh added a comment -

          Updated correct patch.

          Show
          Jonathan Hsieh added a comment - Updated correct patch.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12479647/hdfs-1787.2.patch
          against trunk revision 1124364.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified tests.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/561//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12479647/hdfs-1787.2.patch against trunk revision 1124364. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 2 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/561//console This message is automatically generated.
          Hide
          Jonathan Hsieh added a comment -

          needed to generate patch using

          git diff --no-prefix ...

          Show
          Jonathan Hsieh added a comment - needed to generate patch using git diff --no-prefix ...
          Hide
          Tsz Wo Nicholas Sze added a comment -
          +    ERROR_MAX_TRANSFERS(7);
          

          Could we not adding a new status in DataTransferProtocol? Otherwise, this is an incompatible change.

          Show
          Tsz Wo Nicholas Sze added a comment - + ERROR_MAX_TRANSFERS(7); Could we not adding a new status in DataTransferProtocol ? Otherwise, this is an incompatible change.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12479648/hdfs-1787.3.patch
          against trunk revision 1124364.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.hdfs.TestDFSStorageStateRecovery
          org.apache.hadoop.hdfs.TestFileConcurrentReader
          org.apache.hadoop.tools.TestJMXGet

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/562//testReport/
          Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/562//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/562//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12479648/hdfs-1787.3.patch against trunk revision 1124364. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 2 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.hdfs.TestDFSStorageStateRecovery org.apache.hadoop.hdfs.TestFileConcurrentReader org.apache.hadoop.tools.TestJMXGet +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/562//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/562//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/562//console This message is automatically generated.
          Hide
          Todd Lipcon added a comment -

          @Nicholas: As discussed in some other JIRAs, the wire protocol isn't part of a compatibility guarantee yet. We don't expect HDFS clients to work between 0.21, 0.22, and 0.23 for various other reasons. So, I dont think this should be considered incompatible.

          Show
          Todd Lipcon added a comment - @Nicholas: As discussed in some other JIRAs, the wire protocol isn't part of a compatibility guarantee yet. We don't expect HDFS clients to work between 0.21, 0.22, and 0.23 for various other reasons. So, I dont think this should be considered incompatible.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          @Todd, I have not checked in details but I think that this can be done without adding the new status. If it is true, would it be better?

          Show
          Tsz Wo Nicholas Sze added a comment - @Todd, I have not checked in details but I think that this can be done without adding the new status. If it is true, would it be better?
          Hide
          Todd Lipcon added a comment -

          I haven't looked at the patch in detail yet either Just saying that we shouldn't consider protocol breaks incompatible since they're pretty much guaranteed between versions (eg federation added blockPool IDs which broke every protocol in HDFS)

          Show
          Todd Lipcon added a comment - I haven't looked at the patch in detail yet either Just saying that we shouldn't consider protocol breaks incompatible since they're pretty much guaranteed between versions (eg federation added blockPool IDs which broke every protocol in HDFS)
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Here is more detail.

          //In DataXceiver.opWriteBlock(..),
          
                  } catch (IOException e) {
                    if (isClient) {
                      ERROR.write(replyOut);
                      Text.writeString(replyOut, mirrorNode);
                      replyOut.flush();
                    }
          

          We may send an error message after ERROR.

          Show
          Tsz Wo Nicholas Sze added a comment - Here is more detail. //In DataXceiver.opWriteBlock(..), } catch (IOException e) { if (isClient) { ERROR.write(replyOut); Text.writeString(replyOut, mirrorNode); replyOut.flush(); } We may send an error message after ERROR.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Jonathan, I don't think this is a "newbie" issue but if you are willing to spend time on it, you are welcome to continue working on this.

          Show
          Tsz Wo Nicholas Sze added a comment - Jonathan, I don't think this is a "newbie" issue but if you are willing to spend time on it, you are welcome to continue working on this.
          Hide
          Jonathan Hsieh added a comment -

          My concern with reusing one of the existing ERROR statuses is that it and its data fields may already have semantics assigned to it, and that changing it would potentially break existing error handling behavior. The fact that there are several other ERROR_* return codes seems to indicate that each has a different meaning. Because of this, my inclination was to add a new case and new behavior for that case instead of potentially causing compatibility problems by overriding existing cases.

          I'll spend some time looking at the code to see if co-opting the other ERROR message would cause a problems.

          Show
          Jonathan Hsieh added a comment - My concern with reusing one of the existing ERROR statuses is that it and its data fields may already have semantics assigned to it, and that changing it would potentially break existing error handling behavior. The fact that there are several other ERROR_* return codes seems to indicate that each has a different meaning. Because of this, my inclination was to add a new case and new behavior for that case instead of potentially causing compatibility problems by overriding existing cases. I'll spend some time looking at the code to see if co-opting the other ERROR message would cause a problems.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          ERROR is for the generic errors (something like internal server error), so it should not have problem for using it here. For the other ERROR_*, they are the error types that the client should aware of.

          Show
          Tsz Wo Nicholas Sze added a comment - ERROR is for the generic errors (something like internal server error), so it should not have problem for using it here. For the other ERROR_*, they are the error types that the client should aware of.
          Hide
          Jonathan Hsieh added a comment -

          New patch does not create a new ERROR status on the Data node protocol. Previous patch would keep a count of # of max xceivers exceeded errors, but due to change now can only coun the number of "internal datanode errors".

          Show
          Jonathan Hsieh added a comment - New patch does not create a new ERROR status on the Data node protocol. Previous patch would keep a count of # of max xceivers exceeded errors, but due to change now can only coun the number of "internal datanode errors".
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12480049/hdfs-1787.3.patch
          against trunk revision 1125879.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.hdfs.TestDFSStorageStateRecovery

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/608//testReport/
          Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/608//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/608//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12480049/hdfs-1787.3.patch against trunk revision 1125879. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 2 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.hdfs.TestDFSStorageStateRecovery +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/608//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/608//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/608//console This message is automatically generated.
          Hide
          Jonathan Hsieh added a comment -

          Test results links seems broken here. (pointing to results from changes from a different patch).

          Also, TestDFSStorageStateRecovery seems to have been broken for a long time.

          Show
          Jonathan Hsieh added a comment - Test results links seems broken here. (pointing to results from changes from a different patch). Also, TestDFSStorageStateRecovery seems to have been broken for a long time.
          Hide
          Jonathan Hsieh added a comment -

          Actually, I assumed that the test focused on the patch in the 'changes' section of the jenkins result of build 608. This actually ran the newly added test case from the HDFS-1787 patch..

          The org.apache.hadoop.hdfs.server.datanode.TestFiDataTransferProtocol2.pipeline_Fi_30 test seems to be intermittently failing. It also isn't reported by hudson. Is there a reason why?

          Show
          Jonathan Hsieh added a comment - Actually, I assumed that the test focused on the patch in the 'changes' section of the jenkins result of build 608. This actually ran the newly added test case from the HDFS-1787 patch.. The org.apache.hadoop.hdfs.server.datanode.TestFiDataTransferProtocol2.pipeline_Fi_30 test seems to be intermittently failing. It also isn't reported by hudson. Is there a reason why?
          Hide
          Tsz Wo Nicholas Sze added a comment -
          • Why DFSOutputStream.internalDNErrors is an AtomicInteger but DFSInputStream.internalDNErrors is an int?
          • BlockReader.maxTransfersErrors is not used.
          • What if Text.readString(..) throws an exception?
            +      } else if (status == Status.ERROR)  {
            +        String errmsg = Text.readString(in);
            +        throw new InternalDataNodeException("Got error on read due to "
            
          • It is better to add a catch-block for InternalDataNodeException.
                 } catch (Throwable t) {
            +      // Special case where we transfer max transfers error information to client.
            +      try {
            +        if (t instanceof InternalDataNodeException) {
            +          // send error message.
            
          • new public methods need javadoc.
          • Please use junit 4.
          • Please don't make white space change here for easing the review.
          Show
          Tsz Wo Nicholas Sze added a comment - Why DFSOutputStream.internalDNErrors is an AtomicInteger but DFSInputStream.internalDNErrors is an int ? BlockReader.maxTransfersErrors is not used. What if Text.readString(..) throws an exception? + } else if (status == Status.ERROR) { + String errmsg = Text.readString(in); + throw new InternalDataNodeException( "Got error on read due to " It is better to add a catch-block for InternalDataNodeException . } catch (Throwable t) { + // Special case where we transfer max transfers error information to client. + try { + if (t instanceof InternalDataNodeException) { + // send error message. new public methods need javadoc. Please use junit 4. Please don't make white space change here for easing the review.
          Hide
          Jonathan Hsieh added a comment -

          Nicholas,

          Thanks for the review!

          Why DFSOutputStream.internalDNErrors is an AtomicInteger but DFSInputStream.internalDNErrors is an int?

          In the internalDNErrors case, the "Integer" is used by the ResponseProcessor which is a private Daemon/thread class. I generally avoid non-static inner classes (hard to test) and had assumed that it was static. If it were, it would require a final reference. However, in this case, it is a non-static inner class so can be just an int. Both you and ATM are correct. It has been changed.

          BlockReader.maxTransfersErrors is not used.

          removed

          What if Text.readString(..) throws an exception?
          + } else if (status == Status.ERROR) {
          + String errmsg = Text.readString(in);
          + throw new InternalDataNodeException("Got error on read due to "

          Text.readString can throw IOException. The InternalDataNodeException thrown on the next line is also a subclass of IOException. Behaviorwise it would essentially use the same error recovery path.

          It is better to add a catch-block for InternalDataNodeException.
          } catch (Throwable t) {
          + // Special case where we transfer max transfers error information to client.
          + try {
          + if (t instanceof InternalDataNodeException) {
          + // send error message.

          Done.

          new public methods need javadoc.

          Ah, I missed one in InternalDataNodeException. Done.

          Please use junit 4.

          Done.

          Please don't make white space change here for easing the review.

          I missed a few, and will clean them up.

          Question: There was one checkstyle-type whitespace change that wasn't associated with code change. Should that be reverted or is that a welcome change?

          Show
          Jonathan Hsieh added a comment - Nicholas, Thanks for the review! Why DFSOutputStream.internalDNErrors is an AtomicInteger but DFSInputStream.internalDNErrors is an int? In the internalDNErrors case, the "Integer" is used by the ResponseProcessor which is a private Daemon/thread class. I generally avoid non-static inner classes (hard to test) and had assumed that it was static. If it were, it would require a final reference. However, in this case, it is a non-static inner class so can be just an int. Both you and ATM are correct. It has been changed. BlockReader.maxTransfersErrors is not used. removed What if Text.readString(..) throws an exception? + } else if (status == Status.ERROR) { + String errmsg = Text.readString(in); + throw new InternalDataNodeException("Got error on read due to " Text.readString can throw IOException. The InternalDataNodeException thrown on the next line is also a subclass of IOException. Behaviorwise it would essentially use the same error recovery path. It is better to add a catch-block for InternalDataNodeException. } catch (Throwable t) { + // Special case where we transfer max transfers error information to client. + try { + if (t instanceof InternalDataNodeException) { + // send error message. Done. new public methods need javadoc. Ah, I missed one in InternalDataNodeException. Done. Please use junit 4. Done. Please don't make white space change here for easing the review. I missed a few, and will clean them up. Question: There was one checkstyle-type whitespace change that wasn't associated with code change. Should that be reverted or is that a welcome change?
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12481376/hdfs-1787.5.patch
          against trunk revision 1130870.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/699//testReport/
          Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/699//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/699//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12481376/hdfs-1787.5.patch against trunk revision 1130870. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 2 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/699//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/699//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/699//console This message is automatically generated.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > Question: There was one checkstyle-type whitespace change that wasn't associated with code change. Should that be reverted or is that a welcome change?

          If the whitespace/format changes would make it harder for reviewing, they are not welcomed. If there is only one change, it is fine.

          > Text.readString can throw IOException. The InternalDataNodeException thrown on the next line is also a subclass of IOException. Behaviorwise it would essentially use the same error recovery path.

          However, we will loss the information like socket addresses.

          Some comments:

          • Please combine them into one message.
            +          DFSClient.LOG.warn("Failed to connect to" + targetAddr +": "
            +              + ex.getMessage());
            +          DFSClient.LOG.warn(" Adding to deadNodes and continuing");
            
          • It is better to log the exception.
            +      } catch (IOException e) {
            +        // preserve previous semantics, eat the exception.
            +      }
            
          • Do we really need internalDNErrors and getInternalDNErrorCount()? It is only used in the tests.
          Show
          Tsz Wo Nicholas Sze added a comment - > Question: There was one checkstyle-type whitespace change that wasn't associated with code change. Should that be reverted or is that a welcome change? If the whitespace/format changes would make it harder for reviewing, they are not welcomed. If there is only one change, it is fine. > Text.readString can throw IOException. The InternalDataNodeException thrown on the next line is also a subclass of IOException. Behaviorwise it would essentially use the same error recovery path. However, we will loss the information like socket addresses. Some comments: Please combine them into one message. + DFSClient.LOG.warn( "Failed to connect to" + targetAddr + ": " + + ex.getMessage()); + DFSClient.LOG.warn( " Adding to deadNodes and continuing" ); It is better to log the exception. + } catch (IOException e) { + // preserve previous semantics, eat the exception. + } Do we really need internalDNErrors and getInternalDNErrorCount() ? It is only used in the tests.
          Hide
          Jonathan Hsieh added a comment -

          > Text.readString can throw IOException. The InternalDataNodeException thrown on the next line is also a subclass of IOException. Behaviorwise it would essentially use the same error recovery path.

          However, we will loss the information like socket addresses.

          I believe this is already an error path, but I'll look into this more.

          Some comments:

          Please combine them into one message.

          +          DFSClient.LOG.warn("Failed to connect to" + targetAddr +": "
          +              + ex.getMessage());
          +          DFSClient.LOG.warn(" Adding to deadNodes and continuing");
          

          My plan is to add \n's to the log message.

          It is better to log the exception.
          +      } catch (IOException e) {
          +        // preserve previous semantics, eat the exception.
          +      }
          

          Will add logging.

          Do we really need internalDNErrors and getInternalDNErrorCount()? It is only used in the tests.

          Can you suggest an alternate mechanism for (automated) testing of the changes other than visual inspection of the logs?

          This tests that the error messaging path was exercised and actually provides some information that may be useful in trouble shooting. I believe there are annotations in the works that are semantically mean "public for testing but otherwise private/package". I believe the comment I added would make this reasonably easy to find when this gets integrated throughout.

          Show
          Jonathan Hsieh added a comment - > Text.readString can throw IOException. The InternalDataNodeException thrown on the next line is also a subclass of IOException. Behaviorwise it would essentially use the same error recovery path. However, we will loss the information like socket addresses. I believe this is already an error path, but I'll look into this more. Some comments: Please combine them into one message. + DFSClient.LOG.warn( "Failed to connect to" + targetAddr + ": " + + ex.getMessage()); + DFSClient.LOG.warn( " Adding to deadNodes and continuing" ); My plan is to add \n's to the log message. It is better to log the exception. + } catch (IOException e) { + // preserve previous semantics, eat the exception. + } Will add logging. Do we really need internalDNErrors and getInternalDNErrorCount()? It is only used in the tests. Can you suggest an alternate mechanism for (automated) testing of the changes other than visual inspection of the logs? This tests that the error messaging path was exercised and actually provides some information that may be useful in trouble shooting. I believe there are annotations in the works that are semantically mean "public for testing but otherwise private/package". I believe the comment I added would make this reasonably easy to find when this gets integrated throughout.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > I believe this is already an error path, but I'll look into this more.

          Yes, please see if you can test it and how actual exceptions look like.

          > My plan is to add \n's to the log message.

          If it is separated into two calls, these two log messages may be separated by other log messages. How about add a "\n" in this message and use one call?

          > Can you suggest an alternate mechanism for (automated) testing of the changes other than visual inspection of the logs?

          I think you might be able to use mockito to test it. Not very sure.

          Show
          Tsz Wo Nicholas Sze added a comment - > I believe this is already an error path, but I'll look into this more. Yes, please see if you can test it and how actual exceptions look like. > My plan is to add \n's to the log message. If it is separated into two calls, these two log messages may be separated by other log messages. How about add a "\n" in this message and use one call? > Can you suggest an alternate mechanism for (automated) testing of the changes other than visual inspection of the logs? I think you might be able to use mockito to test it. Not very sure.
          Hide
          Jonathan Hsieh added a comment -

          Unassigning since I haven't had time to work on this and since infrastructure changed underneath it.

          Show
          Jonathan Hsieh added a comment - Unassigning since I haven't had time to work on this and since infrastructure changed underneath it.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12481376/hdfs-1787.5.patch
          against trunk revision 7711049.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9814//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12481376/hdfs-1787.5.patch against trunk revision 7711049. -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9814//console This message is automatically generated.

            People

            • Assignee:
              Unassigned
              Reporter:
              Todd Lipcon
            • Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:

                Development