Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 2.2.0
    • Fix Version/s: None
    • Component/s: metrics
    • Labels:
      None

      Description

      For HDFS-5350 we'd like to report the last few fsimage transfer times as a health metric. This would mean (for example) a sliding window of the last 10 transfer times, when it was last updated, the total count. It'd be nice to have a metrics class that did this.

      It'd also be interesting to have some kind of time-based sliding window for statistics like counts and averages. This would let us answer questions like "how many RPCs happened in the last 10s? minute? 5 minutes? 10 minutes?". Commutative metrics like counts and averages are easy to aggregate in this fashion.

        Issue Links

          Activity

          Hide
          Steve Loughran added a comment -

          Some general moving average classes would be convenient for a lot of things.

          For stats collection on openstack HTTP operations I added one to calculate the ongoing mean & variance: https://github.com/apache/hadoop-common/blob/trunk/hadoop-tools/hadoop-openstack/src/main/java/org/apache/hadoop/fs/swift/util/DurationStats.java

          For Hoya I'm doing something more complicated where I want some kind of half-life on failure rates, to assess the reliability of nodes in the cluster -for long-lived clusters cumulative counts are the wrong approach. All long-lived YARN services are going to need this

          Show
          Steve Loughran added a comment - Some general moving average classes would be convenient for a lot of things. For stats collection on openstack HTTP operations I added one to calculate the ongoing mean & variance: https://github.com/apache/hadoop-common/blob/trunk/hadoop-tools/hadoop-openstack/src/main/java/org/apache/hadoop/fs/swift/util/DurationStats.java For Hoya I'm doing something more complicated where I want some kind of half-life on failure rates, to assess the reliability of nodes in the cluster -for long-lived clusters cumulative counts are the wrong approach. All long-lived YARN services are going to need this
          Hide
          Andrew Wang added a comment -

          Thanks for the pointer Steve, that DurationStats class is basically what I want to do, except sliding.

          For failures, I'd personally first try to do the simple-stupid thing and log all of them. Hopefully failures aren't such a common occurrence that this is an issue

          My current issue for this JIRA though is getting metrics2 to output something that is not a Number. I really want to output a Long[] or Double[], and I could see utility in String too. JMXJsonServlet seems like it'll handle it okay, but this might not be a compatible change since Number is part of AbstractMetric.

          Show
          Andrew Wang added a comment - Thanks for the pointer Steve, that DurationStats class is basically what I want to do, except sliding. For failures, I'd personally first try to do the simple-stupid thing and log all of them. Hopefully failures aren't such a common occurrence that this is an issue My current issue for this JIRA though is getting metrics2 to output something that is not a Number. I really want to output a Long[] or Double[], and I could see utility in String too. JMXJsonServlet seems like it'll handle it okay, but this might not be a compatible change since Number is part of AbstractMetric .

            People

            • Assignee:
              Andrew Wang
              Reporter:
              Andrew Wang
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:

                Development