Hadoop Common
  1. Hadoop Common
  2. HADOOP-3894

DFSClient chould log errors better, and provide better diagnostics

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Trivial Trivial
    • Resolution: Fixed
    • Affects Version/s: 0.19.0
    • Fix Version/s: 0.19.1
    • Component/s: None
    • Labels:
      None

      Description

      In my test runs I see a stack trace from DFSClient, because it isn't logging through the log APIs in its close() method. It should use the logger, for better error reporting

      1. hadoop-3894.patch
        1 kB
        steve_l
      2. hadoop-3894.patch
        3 kB
        steve_l
      3. hadoop-3894.patch
        4 kB
        steve_l
      4. hadoop-3894.patch
        4 kB
        steve_l
      5. hadoop-3894.patch
        0.6 kB
        steve_l

        Issue Links

          Activity

          steve_l created issue -
          steve_l made changes -
          Field Original Value New Value
          Assignee Steve Loughran [ steve_l ]
          steve_l made changes -
          Attachment hadoop-3894.patch [ 12387468 ]
          Hide
          steve_l added a comment -

          Patch for the logging. No extra tests. If you run TestLocalDFS, you should no longer see a stack trace in the ant console, it should be in the log.

          Show
          steve_l added a comment - Patch for the logging. No extra tests. If you run TestLocalDFS, you should no longer see a stack trace in the ant console, it should be in the log.
          steve_l made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          steve_l added a comment -

          Adding a log() statement here is enough to trigger log4j to print a message complaining about log4j configuration in TestMiniMRMapRedDebugScript, so breaking its equality test for output. patching that test to look for the string in the output, instead of equality only.

          Show
          steve_l added a comment - Adding a log() statement here is enough to trigger log4j to print a message complaining about log4j configuration in TestMiniMRMapRedDebugScript, so breaking its equality test for output. patching that test to look for the string in the output, instead of equality only.
          steve_l made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          steve_l made changes -
          Summary DFSClient should log errors in close(), instead of printing to stderr DFSClient chould log errors better, and provide better diagnostics
          Hide
          steve_l added a comment -

          This updated patch

          DFDClient

          • routes stack traces through the logger on close()
          • includes the path to the file in question when raising an exception due to unable to get the blocks of a file. Instead of going
            "Could not get block locations. Aborting..." , it now includes the filename:
            Could not get block locations. Source file "/test/filename" - Aborting...
          • tweaks TestMiniMRMapRedDebugScript to
            -timeout on job execution.
            -looks for the target string using contains(), so log4j configuration complaints do not break the test
            -uses constants for the strings and sleep times
          Show
          steve_l added a comment - This updated patch DFDClient routes stack traces through the logger on close() includes the path to the file in question when raising an exception due to unable to get the blocks of a file. Instead of going "Could not get block locations. Aborting..." , it now includes the filename: Could not get block locations. Source file "/test/filename" - Aborting... tweaks TestMiniMRMapRedDebugScript to -timeout on job execution. -looks for the target string using contains(), so log4j configuration complaints do not break the test -uses constants for the strings and sleep times
          steve_l made changes -
          Attachment hadoop-3894.patch [ 12387631 ]
          steve_l made changes -
          Fix Version/s 0.19.0 [ 12313211 ]
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12387631/hadoop-3894.patch
          against trunk revision 683371.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3021/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3021/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3021/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3021/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12387631/hadoop-3894.patch against trunk revision 683371. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3021/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3021/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3021/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3021/console This message is automatically generated.
          Hide
          Chris Douglas added a comment -
          • This change should be reverted:
            @@ -157,6 +162,7 @@
                 // run the job. It will fail with IOException.
                 try {
                   job = new JobClient(conf).submitJob(conf);
            +      //fail("Expected job submission to fail here");
                 } catch (IOException e) {
                  LOG.info("Running Job failed", e);
                 }
            
          • Should TestMiniMRMapRedDebugScript manage its own timeouts? This seems like ant's responsibility, though removing the busy loop is clearly an improvement. The test takes less than a minute to run, but I'm not sure that taking longer than that necessarily implies failure, neither is it going to take considerably less time without the loop.
          • Why is it preferred to check that the output contains what was expected, rather than being exactly what was expected?
          • This adds a parameter to processDatanodeError, but doesn't reference it.

          I'm not sure I understand what this JIRA addresses...

          Show
          Chris Douglas added a comment - This change should be reverted: @@ -157,6 +162,7 @@ // run the job. It will fail with IOException. try { job = new JobClient(conf).submitJob(conf); + //fail("Expected job submission to fail here"); } catch (IOException e) { LOG.info("Running Job failed", e); } Should TestMiniMRMapRedDebugScript manage its own timeouts? This seems like ant's responsibility, though removing the busy loop is clearly an improvement. The test takes less than a minute to run, but I'm not sure that taking longer than that necessarily implies failure, neither is it going to take considerably less time without the loop. Why is it preferred to check that the output contains what was expected, rather than being exactly what was expected? This adds a parameter to processDatanodeError, but doesn't reference it. I'm not sure I understand what this JIRA addresses...
          Chris Douglas made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          steve_l added a comment -

          Chris, I'll look at this next week; I'm don't have access to my development machine right now.

          -Timeouts in tests are interesting. When Ant times out a test run it kills the JVM; you lose the trace of where things got to. Its much better to have the test routine doing its own timeouts and saying where they failed. But the timeouts should be
          1. shared across all tests for easy maintenance
          2. configurable by whoever runs the test
          which really means shared default timeouts that can be overridden from system properties, and hence in the IDE (useful for stepping through code) or build.properties

          I could make this a separate improvement issue if you like.

          -The reason I switched to using contains() in the text output rather than just equals() was that once log4j starts getting involved in DFSClient output, the output started to look different. I can't recall what, but I can capture it.

          Show
          steve_l added a comment - Chris, I'll look at this next week; I'm don't have access to my development machine right now. -Timeouts in tests are interesting. When Ant times out a test run it kills the JVM; you lose the trace of where things got to. Its much better to have the test routine doing its own timeouts and saying where they failed. But the timeouts should be 1. shared across all tests for easy maintenance 2. configurable by whoever runs the test which really means shared default timeouts that can be overridden from system properties, and hence in the IDE (useful for stepping through code) or build.properties I could make this a separate improvement issue if you like. -The reason I switched to using contains() in the text output rather than just equals() was that once log4j starts getting involved in DFSClient output, the output started to look different. I can't recall what, but I can capture it.
          Hide
          steve_l added a comment -

          This patch removes the spurious commented line.

          This patch does three things
          moves DFSClient to use the log API, rather than stderr when errors occur -and to print the filename that is at fault
          -tweaks the test case to look for the "Bailing out" message anywhere in the output, so if log4j decides to say something, it doesn't break the test.
          -adds timeouts if the tests take too long. Yes, Ant does this but ant just kills the process and destroys any XML content being generated, so tools like hudson dont deal with such deaths well.

          Show
          steve_l added a comment - This patch removes the spurious commented line. This patch does three things moves DFSClient to use the log API, rather than stderr when errors occur -and to print the filename that is at fault -tweaks the test case to look for the "Bailing out" message anywhere in the output, so if log4j decides to say something, it doesn't break the test. -adds timeouts if the tests take too long. Yes, Ant does this but ant just kills the process and destroys any XML content being generated, so tools like hudson dont deal with such deaths well.
          steve_l made changes -
          Attachment hadoop-3894.patch [ 12390379 ]
          steve_l made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Chris Douglas added a comment -
          • This still adds the String parameter to processDatanodeError that is never referenced in the body of that method
          • Does this test time out often? Is there a contract that this should take less than a minute to run? I completely sympathize with the impulse to retain the information that ant discards and something is probably amiss if this test hits the timeout, but if we're going to fail tests because they violate limits then those limits should be based on something quantifiable and intrinsic to the model we're testing.

          If the purpose of this issue is to use log4j instead of System.err to print the error in DFSClient::close, then this should probably be limited to changing that line and updating the unit test.

          Show
          Chris Douglas added a comment - This still adds the String parameter to processDatanodeError that is never referenced in the body of that method Does this test time out often? Is there a contract that this should take less than a minute to run? I completely sympathize with the impulse to retain the information that ant discards and something is probably amiss if this test hits the timeout, but if we're going to fail tests because they violate limits then those limits should be based on something quantifiable and intrinsic to the model we're testing. If the purpose of this issue is to use log4j instead of System.err to print the error in DFSClient::close, then this should probably be limited to changing that line and updating the unit test.
          Chris Douglas made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          steve_l added a comment -

          1. processDataNodeError does use the src parameter when creating an exception to forward up.

          lastException = new IOException("Could not get block locations. " +
          "Source file \""+src
          + "\" - Aborting...");

          Adding the file that caused the error is generally useful, as it tells the caller which file is playing up. That may be immediately obvious if using the command line tools, but not when you are working with an application that is issuing lots of calls. Having that filename in test results is invaluable.

          2. I managed to get the code to timeout while making changes in other parts of the system.

          Show
          steve_l added a comment - 1. processDataNodeError does use the src parameter when creating an exception to forward up. lastException = new IOException("Could not get block locations. " + "Source file \""+src + "\" - Aborting..."); Adding the file that caused the error is generally useful, as it tells the caller which file is playing up. That may be immediately obvious if using the command line tools, but not when you are working with an application that is issuing lots of calls. Having that filename in test results is invaluable. 2. I managed to get the code to timeout while making changes in other parts of the system.
          Hide
          Chris Douglas added a comment -

          The parameter added to processDatanodeError is not named src:

          @@ -2349,7 +2349,7 @@
               // threads and mark stream as closed. Returns true if we should
               // sleep for a while after returning from this call.
               //
          -    private boolean processDatanodeError(boolean hasError) {
          +    private boolean processDatanodeError(boolean hasError, String text) {
                 if (!hasError) {
                   return false;
                 }
          
          Show
          Chris Douglas added a comment - The parameter added to processDatanodeError is not named src: @@ -2349,7 +2349,7 @@ // threads and mark stream as closed. Returns true if we should // sleep for a while after returning from this call. // - private boolean processDatanodeError(boolean hasError) { + private boolean processDatanodeError(boolean hasError, String text) { if (!hasError) { return false; }
          Devaraj Das made changes -
          Fix Version/s 0.19.0 [ 12313211 ]
          steve_l made changes -
          Link This issue is duplicated by HADOOP-4965 [ HADOOP-4965 ]
          Hide
          steve_l added a comment -

          This is an updated version of the patch.

          1. the parameter to processDatanodeError() has been removed. I had intended it to show the source file, but as that is in the this.src field, it is unneeded. The src field is used directly in the error message to tell you which file is unreachable

          This code is a superset of HADOOP-4965; it adds the more detailed exception, and a test case that failed once the logging was added has been tweaked to catch the text and include a timeout on the submission.

          Show
          steve_l added a comment - This is an updated version of the patch. the parameter to processDatanodeError() has been removed. I had intended it to show the source file, but as that is in the this.src field, it is unneeded. The src field is used directly in the error message to tell you which file is unreachable This code is a superset of HADOOP-4965 ; it adds the more detailed exception, and a test case that failed once the logging was added has been tweaked to catch the text and include a timeout on the submission.
          steve_l made changes -
          Attachment hadoop-3894.patch [ 12397795 ]
          steve_l made changes -
          Fix Version/s 0.21.0 [ 12313563 ]
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Konstantin Shvachko added a comment -

          Steve, I didn't know you worked on this problem 6 months ago. Would have saved me time...
          I see there is still an unresolved issue with timeouts in TestMiniMRMapRedDebugScript. We usually don't appreciate sleeps in unit tests.
          So I propose that you

          • isolate the TestMiniMRMapRedDebugScript issue into a separate jira if it is still current.
          • incorporate my changes to TestFileAppend3
          • and I'll commit the result so that we could close both issues.

          Let me know if the plan works for you and how soon it can be done.

          Show
          Konstantin Shvachko added a comment - Steve, I didn't know you worked on this problem 6 months ago. Would have saved me time... I see there is still an unresolved issue with timeouts in TestMiniMRMapRedDebugScript. We usually don't appreciate sleeps in unit tests. So I propose that you isolate the TestMiniMRMapRedDebugScript issue into a separate jira if it is still current. incorporate my changes to TestFileAppend3 and I'll commit the result so that we could close both issues. Let me know if the plan works for you and how soon it can be done.
          Hide
          Konstantin Shvachko added a comment -

          Steve, are you still on this?

          Show
          Konstantin Shvachko added a comment - Steve, are you still on this?
          Hide
          steve_l added a comment -

          Yes, I've just been making sure that all the tests are passing

          I will separate the timeout issue,though I do prefer having tests time out with meaningful diagnostics rather than killing the process.

          The logging issue there is probably due to me not having a log4.properties in my own hadoop.jar; it was interfering with other apps. I am probably the only person getting told off from Log4J for not having a config file, which is really something log4j should stop complaining about.

          Show
          steve_l added a comment - Yes, I've just been making sure that all the tests are passing I will separate the timeout issue,though I do prefer having tests time out with meaningful diagnostics rather than killing the process. The logging issue there is probably due to me not having a log4.properties in my own hadoop.jar; it was interfering with other apps. I am probably the only person getting told off from Log4J for not having a config file, which is really something log4j should stop complaining about.
          steve_l made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          steve_l added a comment -

          Patch to DFSClient only

          Show
          steve_l added a comment - Patch to DFSClient only
          steve_l made changes -
          Attachment hadoop-3894.patch [ 12398069 ]
          Hide
          steve_l added a comment -

          This is the patch for DFS Client
          1. Logs the error in log4j
          2. includes the underlying filename when there are problems with blocks. Handy that.

          Show
          steve_l added a comment - This is the patch for DFS Client 1. Logs the error in log4j 2. includes the underlying filename when there are problems with blocks. Handy that.
          steve_l made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Konstantin Shvachko added a comment -

          I just committed this. Thank you Steve.

          Show
          Konstantin Shvachko added a comment - I just committed this. Thank you Steve.
          Konstantin Shvachko made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Fix Version/s 0.21.0 [ 12313563 ]
          Fix Version/s 0.19.1 [ 12313473 ]
          Nigel Daley made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Owen O'Malley made changes -
          Component/s dfs [ 12310710 ]
          Gavin made changes -
          Reporter Steve Loughran [ steve_l ] Steve Loughran [ stevel@apache.org ]
          Gavin made changes -
          Assignee Steve Loughran [ steve_l ] Steve Loughran [ stevel@apache.org ]

            People

            • Assignee:
              Steve Loughran
              Reporter:
              Steve Loughran
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development