Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.21.0
    • Component/s: task
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      It's easy to grab the number of millis spent in GC (see JvmMetrics for example). Exposing these as task counters would be handy - occasionally I've seen user jobs where long GC pauses cause big "unexplainable" performance problems, and a large counter would make it obvious to the user what's going on.

      1. MAPREDUCE-1304.patch
        7 kB
        Aaron Kimball
      2. MAPREDUCE-1304.3.patch
        9 kB
        Aaron Kimball
      3. MAPREDUCE-1304.2.patch
        7 kB
        Aaron Kimball

        Issue Links

          Activity

          Hide
          Arun C Murthy added a comment -

          +1, this is useful information.

          Show
          Arun C Murthy added a comment - +1, this is useful information.
          Hide
          Aaron Kimball added a comment -

          Added a new TaskCounter, GC_TIME_MILLIS, which tracks this information. Patch also includes a testcase.

          Show
          Aaron Kimball added a comment - Added a new TaskCounter, GC_TIME_MILLIS, which tracks this information. Patch also includes a testcase.
          Hide
          Todd Lipcon added a comment -

          Two questions:

          • I don't know the Task code quite well enough, but will this still be set in the case that the task fails?
          • Is it worth putting this in a separate thread in the Child interface so it actually gets incremented as the task goes, with each umbilical heartbeat? It would be nice to see this on a per-task basis when a task appears to be "stuck".
          Show
          Todd Lipcon added a comment - Two questions: I don't know the Task code quite well enough, but will this still be set in the case that the task fails? Is it worth putting this in a separate thread in the Child interface so it actually gets incremented as the task goes, with each umbilical heartbeat? It would be nice to see this on a per-task basis when a task appears to be "stuck".
          Hide
          Aaron Kimball added a comment -

          You're right that if a task throws an exception, I think that it will probably not set the counter. I could put the increment in a finally block and that would fix that issue.

          Based on a quick look at the code in Counters and Counter, I think that there wouldn't be major issues with thread safety or performance (every operation on a Counter is already synchronized). An extra thread is pretty heavy-weight, though. I think something like this should actually go into the TaskReporter itself; it could just increment the gc counter itself right before sending a status every 3 seconds.

          I ran a quick benchmark test which got a handle to the GarbageCollectorMXBean list and polled them; I put all of this in a loop and ran it a million times in about a half-second, so I don't think this would negatively impact performance.

          Show
          Aaron Kimball added a comment - You're right that if a task throws an exception, I think that it will probably not set the counter. I could put the increment in a finally block and that would fix that issue. Based on a quick look at the code in Counters and Counter, I think that there wouldn't be major issues with thread safety or performance (every operation on a Counter is already synchronized). An extra thread is pretty heavy-weight, though. I think something like this should actually go into the TaskReporter itself; it could just increment the gc counter itself right before sending a status every 3 seconds. I ran a quick benchmark test which got a handle to the GarbageCollectorMXBean list and polled them; I put all of this in a loop and ran it a million times in about a half-second, so I don't think this would negatively impact performance.
          Hide
          Todd Lipcon added a comment -

          I think something like this should actually go into the TaskReporter itself

          Seems reasonable enough to me.

          Show
          Todd Lipcon added a comment - I think something like this should actually go into the TaskReporter itself Seems reasonable enough to me.
          Hide
          Aaron Kimball added a comment -

          New patch moves this logic into TaskReporter so that it happens every status update interval.

          Show
          Aaron Kimball added a comment - New patch moves this logic into TaskReporter so that it happens every status update interval.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12437301/MAPREDUCE-1304.2.patch
          against trunk revision 916823.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/340/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/340/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/340/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/340/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12437301/MAPREDUCE-1304.2.patch against trunk revision 916823. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/340/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/340/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/340/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/340/console This message is automatically generated.
          Hide
          Aaron Kimball added a comment -

          Test failure is unrelated

          Show
          Aaron Kimball added a comment - Test failure is unrelated
          Hide
          Chris Douglas added a comment -

          Great idea. A few comments on the impl:

          • This has no effect:
            +    job.getConfiguration().set("io.sort.record.pct", "0.50");
            
          • Moving the update of the GC counter to Task::updateCounters makes sense. To capture any FileSystem activity in the committer, the call to updateCounters in Task::done should be after the commit, anyway. This is a bug in the current code.
          • Adding java.lang.management components shouldn't be widening the TaskReporter API and adding fields. There should be an *Updater object tracking the particular statistic being tracked (as in the FileSystem counters) rather than letting the TaskReporter keep this state.
          • Updating the GC counters must be synchronized or it risks reporting incorrect results (moving the update to Task::updateCounters would be sufficient).
          Show
          Chris Douglas added a comment - Great idea. A few comments on the impl: This has no effect: + job.getConfiguration().set("io.sort.record.pct", "0.50"); Moving the update of the GC counter to Task::updateCounters makes sense. To capture any FileSystem activity in the committer, the call to updateCounters in Task::done should be after the commit, anyway. This is a bug in the current code. Adding java.lang.management components shouldn't be widening the TaskReporter API and adding fields. There should be an *Updater object tracking the particular statistic being tracked (as in the FileSystem counters) rather than letting the TaskReporter keep this state. Updating the GC counters must be synchronized or it risks reporting incorrect results (moving the update to Task::updateCounters would be sufficient).
          Hide
          Aaron Kimball added a comment -

          New patch incorporates CR comments.

          I added a call to updateCounters() after the reporter thread is shut down, where the incrementGcCounter() call used to be.

          Show
          Aaron Kimball added a comment - New patch incorporates CR comments. I added a call to updateCounters() after the reporter thread is shut down, where the incrementGcCounter() call used to be.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12439944/MAPREDUCE-1304.3.patch
          against trunk revision 928104.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/65/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/65/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/65/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/65/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12439944/MAPREDUCE-1304.3.patch against trunk revision 928104. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/65/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/65/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/65/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/65/console This message is automatically generated.
          Hide
          Chris Douglas added a comment -

          +1

          I committed this. Thanks, Aaron!

          Show
          Chris Douglas added a comment - +1 I committed this. Thanks, Aaron!

            People

            • Assignee:
              Aaron Kimball
              Reporter:
              Todd Lipcon
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development