Hadoop Common
  1. Hadoop Common
  2. HADOOP-3422

Ganglia counter metrics are all reported with the metric name "value", so the counter values can not be seen

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.17.2, 0.18.0, 0.18.1, 0.18.2
    • Fix Version/s: 0.20.0
    • Component/s: metrics
    • Labels:
      None
    • Hadoop Flags:
      Incompatible change, Reviewed
    • Release Note:
      Changed names of ganglia metrics to avoid conflicts and to better identify source function.

      Description

      The JobInProgress class reports all metrics with the name "value". The FileMetrics class puts all of the tags into the name when reporting the individual values, but the Ganglia Context does not put the tags into the name..

      This patch modifies the context to build names for the counter metrics out of the tag values. This enables the user to see the indivdual counter values with the ganglia web tool, on a per job basis

      1. diff-20080520-1025.txt
        3 kB
        Jason
      2. ganglia-patch-3422-and-4137.patch
        4 kB
        Brian Bockelman
      3. ganglia-patch-3422-and-4137.patch
        4 kB
        Brian Bockelman
      4. ganglia-patch-3422-qand-4137-1.patch
        2 kB
        Jason
      5. ganglia-patch-3422-updated.patch
        3 kB
        Brian Bockelman
      6. ganglia-patch-3422-updated-v2.patch
        5 kB
        stack

        Issue Links

          Activity

          Hide
          Jason added a comment -

          A patch against trunk to provide this.
          Modifies JobInProgress and GangliaContext

          Show
          Jason added a comment - A patch against trunk to provide this. Modifies JobInProgress and GangliaContext
          Hide
          Nigel Daley added a comment -

          This is not a blocker for 0.17.0 (for which a release candidate is being voted on). Moving to 0.18

          Show
          Nigel Daley added a comment - This is not a blocker for 0.17.0 (for which a release candidate is being voted on). Moving to 0.18
          Hide
          Doug Cutting added a comment -

          Did these counters work with Ganglia in a prior release? If so, then this is a regression and might be a candidate for 0.17.1...

          Show
          Doug Cutting added a comment - Did these counters work with Ganglia in a prior release? If so, then this is a regression and might be a candidate for 0.17.1...
          Hide
          Jason added a comment -

          They did not work correctly in 0.15.3 or in 0.16.0. It is a very minor change and I think people would like it.

          The only other issue i currently have is that i have to set up my hadoop-metrics.properties files such that the report to node:
          jvm.servers=HOST:8649 ..., where HOST is the head node, if i use localhost, the metrics end up all in a bucket under the psuedo machine localhost.localdomain
          If the HOST is a real host, they show up under the node's statistics

          Show
          Jason added a comment - They did not work correctly in 0.15.3 or in 0.16.0. It is a very minor change and I think people would like it. The only other issue i currently have is that i have to set up my hadoop-metrics.properties files such that the report to node: jvm.servers=HOST:8649 ..., where HOST is the head node, if i use localhost, the metrics end up all in a bucket under the psuedo machine localhost.localdomain If the HOST is a real host, they show up under the node's statistics
          Hide
          Jason added a comment -

          The names being produced for ganglia are long and tend to overflow the bounding box in the graph.
          mapred.job.FQCN.COUNTER_NAME
          where FQCN is the fully qualified class that contains the counter ie: org.apache.hadoop.mapred.XXX.YYY

          And it would be nice to get this from the individual machines as well as the cluster agreggate so you could tell at a glance how the individual machines were doing. I suspect that will be annother patch for that new feature.

          Show
          Jason added a comment - The names being produced for ganglia are long and tend to overflow the bounding box in the graph. mapred.job.FQCN.COUNTER_NAME where FQCN is the fully qualified class that contains the counter ie: org.apache.hadoop.mapred.XXX.YYY And it would be nice to get this from the individual machines as well as the cluster agreggate so you could tell at a glance how the individual machines were doing. I suspect that will be annother patch for that new feature.
          Hide
          dhruba borthakur added a comment -

          +1. This patch is very helpful. Are there any unit tests that tests the HadoopMetrics API?

          Show
          dhruba borthakur added a comment - +1. This patch is very helpful. Are there any unit tests that tests the HadoopMetrics API?
          Hide
          Lohit Vijayarenu added a comment -

          Are there any unit tests that tests the HadoopMetrics API?

          Not as of now. Have opened a JIRA HADOOP-3634 for this. Should be easy if we model something along the lines of FileContext.

          Show
          Lohit Vijayarenu added a comment - Are there any unit tests that tests the HadoopMetrics API? Not as of now. Have opened a JIRA HADOOP-3634 for this. Should be easy if we model something along the lines of FileContext.
          Hide
          Lohit Vijayarenu added a comment -

          Trunk has changed, would it be possible to regenerate this patch? We can make it PA once done.

          Show
          Lohit Vijayarenu added a comment - Trunk has changed, would it be possible to regenerate this patch? We can make it PA once done.
          Hide
          Brian Bockelman added a comment -

          Patch against more recent trunk (rev 701680). Combined the existing patch with the one submitter for #HADOOP-4137; without both, Ganglia metrics appear to be broken.

          Show
          Brian Bockelman added a comment - Patch against more recent trunk (rev 701680). Combined the existing patch with the one submitter for # HADOOP-4137 ; without both, Ganglia metrics appear to be broken.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12391475/ganglia-patch-3422-and-4137.patch
          against trunk revision 701476.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no tests are needed for this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3433/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3433/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3433/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3433/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12391475/ganglia-patch-3422-and-4137.patch against trunk revision 701476. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3433/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3433/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3433/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3433/console This message is automatically generated.
          Hide
          Brian Bockelman added a comment -

          Updated patch: Silly me, Ganglia has no long data type - and I mapped it to int32, which overflows quickly (especially when reporting bytes transferred!).

          This now maps Long to float in Ganglia.

          Show
          Brian Bockelman added a comment - Updated patch: Silly me, Ganglia has no long data type - and I mapped it to int32, which overflows quickly (especially when reporting bytes transferred!). This now maps Long to float in Ganglia.
          Hide
          stack added a comment -

          I took a look at the patch. You have this in a few places '// TODO: Log this.'. Why not log the exception?

          You keep a ThreadLocal StringBuffer to save on construction cost? Could you use a StringBuilder instead of a StringBuffer?

          Otherwise, this patch is great. I tried it. Ganglia graphs work with this in place.

          Show
          stack added a comment - I took a look at the patch. You have this in a few places '// TODO: Log this.'. Why not log the exception? You keep a ThreadLocal StringBuffer to save on construction cost? Could you use a StringBuilder instead of a StringBuffer? Otherwise, this patch is great. I tried it. Ganglia graphs work with this in place.
          Hide
          Brian Bockelman added a comment -

          Hey stack,

          1) I didn't add the log statements because I don't know the "preferred" way to add a logger to classes for this project. I hoped someone who knew would come along and help out.
          2) I don't know the logic behind the ThreadLocal StringBuffer stuff. Adoption from the original patch.
          3) The update to the patch which maps Long to float is bust. The class doesn't have a way to serialize floats with XDR, so I'm not sure what would happen.

          For now, use the original, not the updated patch - having the buffer overflow is better than nothing at all with floats.

          Show
          Brian Bockelman added a comment - Hey stack, 1) I didn't add the log statements because I don't know the "preferred" way to add a logger to classes for this project. I hoped someone who knew would come along and help out. 2) I don't know the logic behind the ThreadLocal StringBuffer stuff. Adoption from the original patch. 3) The update to the patch which maps Long to float is bust. The class doesn't have a way to serialize floats with XDR, so I'm not sure what would happen. For now, use the original, not the updated patch - having the buffer overflow is better than nothing at all with floats.
          Hide
          stack added a comment -

          Good stuff Brian.

          1). Just by way of FYI, see MetricsUtil in same package for how to add logging. Import two classes, then just inside your class opening, add something like:

            private static final Log LOG =
              LogFactory.getLog("org.apache.hadoop.util.GangliaContext");
          

          In code, do something like this: LOG.warn("Type is null"); ... and so on.

          2). Let me write Jason and see what he was up to with the ThreadLocal. I don't understand. Its not keeping any state so why bother with it.
          3). Doesn't your patch also address HADOOP-4137 whereas diff-20080520-1025.txt does not?

          Show
          stack added a comment - Good stuff Brian. 1). Just by way of FYI, see MetricsUtil in same package for how to add logging. Import two classes, then just inside your class opening, add something like: private static final Log LOG = LogFactory.getLog( "org.apache.hadoop.util.GangliaContext" ); In code, do something like this: LOG.warn("Type is null"); ... and so on. 2). Let me write Jason and see what he was up to with the ThreadLocal. I don't understand. Its not keeping any state so why bother with it. 3). Doesn't your patch also address HADOOP-4137 whereas diff-20080520-1025.txt does not?
          Hide
          Jason added a comment -

          This patch removes the thread local code and uses StringBuilder instead of StringBuffer.

          The core ganglia pieces are unchanged.

          At this point an outstanding issue is that it is not possible to distinguish the metrics from different jvms.

          Show
          Jason added a comment - This patch removes the thread local code and uses StringBuilder instead of StringBuffer. The core ganglia pieces are unchanged. At this point an outstanding issue is that it is not possible to distinguish the metrics from different jvms.
          Hide
          Chris Douglas added a comment -

          What is the status of this issue and HADOOP-4137? The latter patch has been absorbed into the one posted here, but the two fixes appear independent (and the latest conflict). I'm canceling both patches for now, since it is unclear whether either is to be committed.

          Show
          Chris Douglas added a comment - What is the status of this issue and HADOOP-4137 ? The latter patch has been absorbed into the one posted here, but the two fixes appear independent (and the latest conflict). I'm canceling both patches for now, since it is unclear whether either is to be committed.
          Hide
          Brian Bockelman added a comment -

          The attached patch does the following:
          1) Removes the TODO lines by adding the logger.
          2) Incorporates HADOOP-4137. I'll mark that one as a duplicate of this.
          3) Includes Jason's updated patch.

          Builds cleanly against trunk.

          Show
          Brian Bockelman added a comment - The attached patch does the following: 1) Removes the TODO lines by adding the logger. 2) Incorporates HADOOP-4137 . I'll mark that one as a duplicate of this. 3) Includes Jason's updated patch. Builds cleanly against trunk.
          Hide
          stack added a comment -

          Couple of nitpicks: Use StringBuilder instead of StringBuffer and the set of StringBuffer length to zero on line #47 is not needed since you just created a new instance. Also, make comment on line #45 < 80 characters. Otherwise, I tried the patch against 0.19.0. All works.

          What you think we should do about rpc and jvm 'record' names? They both use 'metrics'. Its kinda useless. Means all rpc and jvm metrics get lumped together. Other metrics records are better named: e.g. 'datanode', 'namenode', etc. Should we fix this as part of this patch or in a new issue?

          The jvm metrics record might be named for the application its hosting. And rpc the same?

          Show
          stack added a comment - Couple of nitpicks: Use StringBuilder instead of StringBuffer and the set of StringBuffer length to zero on line #47 is not needed since you just created a new instance. Also, make comment on line #45 < 80 characters. Otherwise, I tried the patch against 0.19.0. All works. What you think we should do about rpc and jvm 'record' names? They both use 'metrics'. Its kinda useless. Means all rpc and jvm metrics get lumped together. Other metrics records are better named: e.g. 'datanode', 'namenode', etc. Should we fix this as part of this patch or in a new issue? The jvm metrics record might be named for the application its hosting. And rpc the same?
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12394071/ganglia-patch-3422-updated.patch
          against trunk revision 719152.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no tests are needed for this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3611/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3611/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3611/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3611/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12394071/ganglia-patch-3422-updated.patch against trunk revision 719152. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3611/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3611/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3611/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3611/console This message is automatically generated.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Seems that the failed test, TestFsck, is not robust. There was another failing in HADOOP-4647. I believe both of the failing cases are not related to the patches posted.

          Show
          Tsz Wo Nicholas Sze added a comment - Seems that the failed test, TestFsck, is not robust. There was another failing in HADOOP-4647 . I believe both of the failing cases are not related to the patches posted.
          Hide
          stack added a comment -

          I agree the test failure is unrelated but would like to post another patch anyways to address comments above.

          Show
          stack added a comment - I agree the test failure is unrelated but would like to post another patch anyways to address comments above.
          Hide
          stack added a comment -

          Version of patch that addresses the above comments (Uses StringBuilder instead of StringBuffer, removes extraneous setLength(0), etc.). Also adds overload of JVMMetrics#init method so can pass name of record to use. Default is to use the old 'metrics' name. Overload makes it so can pass in alternate record name. This makes it so can have metrics per jvm rather than have all jvm metrics lumped together. For example, in hbase, we can now have graphs for both the master and regionserver's jvms show in ganglia. Doing similar for rpc didn't make as much sense since method names are shown (and override of RpcMetrics creation is not possible currently).

          Show
          stack added a comment - Version of patch that addresses the above comments (Uses StringBuilder instead of StringBuffer, removes extraneous setLength(0), etc.). Also adds overload of JVMMetrics#init method so can pass name of record to use. Default is to use the old 'metrics' name. Overload makes it so can pass in alternate record name. This makes it so can have metrics per jvm rather than have all jvm metrics lumped together. For example, in hbase, we can now have graphs for both the master and regionserver's jvms show in ganglia. Doing similar for rpc didn't make as much sense since method names are shown (and override of RpcMetrics creation is not possible currently).
          Hide
          stack added a comment -

          Try hudson w/ new version of patch.

          Show
          stack added a comment - Try hudson w/ new version of patch.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12394440/ganglia-patch-3422-updated-v2.patch
          against trunk revision 719787.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no tests are needed for this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3635/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3635/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3635/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3635/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12394440/ganglia-patch-3422-updated-v2.patch against trunk revision 719787. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3635/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3635/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3635/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3635/console This message is automatically generated.
          Hide
          Brian Bockelman added a comment -

          I set HADOOP-4675 to depend on this, as we can't start supporting Ganglia 3.1 until these basic Ganglia issues are resolved.

          I like stack's refinements. Am I allowed to vote +1 on it if I helped push the patch along? I guess probably not...

          Show
          Brian Bockelman added a comment - I set HADOOP-4675 to depend on this, as we can't start supporting Ganglia 3.1 until these basic Ganglia issues are resolved. I like stack's refinements. Am I allowed to vote +1 on it if I helped push the patch along? I guess probably not...
          Hide
          Doug Cutting added a comment -

          > Am I allowed to vote +1 on it if I helped push the patch along? I guess probably not...

          Yes, of course you are allowed to vote. In the case of a dispute, PMC votes trump contributors, but everyone is encouraged to vote, and their opinions are taken seriously.

          Show
          Doug Cutting added a comment - > Am I allowed to vote +1 on it if I helped push the patch along? I guess probably not... Yes, of course you are allowed to vote. In the case of a dispute, PMC votes trump contributors, but everyone is encouraged to vote, and their opinions are taken seriously.
          Hide
          stack added a comment -

          Thanks for the review Brian.

          Unless objection, I'll commit to TRUNK in the next day or so.

          Should I commit to branch-0.19 too? I'd like to but its only a 'major' bug, not critical nor a blocker.

          Show
          stack added a comment - Thanks for the review Brian. Unless objection, I'll commit to TRUNK in the next day or so. Should I commit to branch-0.19 too? I'd like to but its only a 'major' bug, not critical nor a blocker.
          Hide
          stack added a comment -

          Committed. Thanks for the patch Jason and Brian.

          Show
          stack added a comment - Committed. Thanks for the patch Jason and Brian.
          Hide
          Robert Chansler added a comment -

          Edit release note for publication.

          jvm.metrics.*
          mapred.job.Job Counters .*
          mapred.job.Map-Reduce Framework.*
          mapred.job.Map-Reduce Framework.*
          mapred.job.Map-Reduce Framework.*
          mapred.jobtracker.*
          mapred.shuffleInput.*
          mapred.tasktracker.*

          instead of just using the last token name, or in the case of Per Job Counters, the string value, for all counters.

          The net result of this change is that the metrics names make it clear which part of hadoop is reporting, and the counter values each have their own RRD file instead of all the counters being aggregated under the name "value"

          Show
          Robert Chansler added a comment - Edit release note for publication. jvm.metrics.* mapred.job.Job Counters .* mapred.job.Map-Reduce Framework.* mapred.job.Map-Reduce Framework.* mapred.job.Map-Reduce Framework.* mapred.jobtracker.* mapred.shuffleInput.* mapred.tasktracker.* instead of just using the last token name, or in the case of Per Job Counters, the string value, for all counters. The net result of this change is that the metrics names make it clear which part of hadoop is reporting, and the counter values each have their own RRD file instead of all the counters being aggregated under the name "value"

            People

            • Assignee:
              stack
              Reporter:
              Jason
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 1h
                1h
                Remaining:
                Remaining Estimate - 1h
                1h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Development