Details

      Description

      Off the top of my head, I can think of:

      NN metrics:

      • A binary metric for active or standby
      • The size of the pending DN message queues
      • A timestamp for when the standby NN last read from shared edit log
      • The difference between highest generation stamp seen from the shared edit log and the highest generation stamp seen from any DN

      It would probably also be useful to have a DN metric which somehow describes which active/standby NNs its talking to, e.g. "times since last communicated with standby/active NNs."

      I'm sure there are others as well. Comments strongly encouraged.

      1. HDFS-2510.HDFS-1623.patch
        7 kB
        Aaron T. Myers
      2. HDFS-2510-HDFS-1623.patch
        7 kB
        Aaron T. Myers

        Activity

        Hide
        hudson Hudson added a comment -

        Integrated in Hadoop-Hdfs-HAbranch-build #74 (See https://builds.apache.org/job/Hadoop-Hdfs-HAbranch-build/74/)
        HDFS-2510. Add HA-related metrics. Contributed by Aaron T. Myers.

        atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1242410
        Files :

        • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/CHANGES.HDFS-1623.txt
        • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
        • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/EditLogTailer.java
        • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestHAMetrics.java
        Show
        hudson Hudson added a comment - Integrated in Hadoop-Hdfs-HAbranch-build #74 (See https://builds.apache.org/job/Hadoop-Hdfs-HAbranch-build/74/ ) HDFS-2510 . Add HA-related metrics. Contributed by Aaron T. Myers. atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1242410 Files : /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/CHANGES. HDFS-1623 .txt /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/EditLogTailer.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestHAMetrics.java
        Hide
        atm Aaron T. Myers added a comment -

        Thanks a lot for the reviews, Todd. I've just committed this to the branch.

        Show
        atm Aaron T. Myers added a comment - Thanks a lot for the reviews, Todd. I've just committed this to the branch.
        Hide
        tlipcon Todd Lipcon added a comment -

        Sorry, missed the comment above:

        Similarly, I couldn't think of anything useful an operator could get from this. It also doesn't help the situation that currently all DN metrics are per-DN-daemon, not per BP offer service. Thus, it's not obvious how to get meaningful DN-side metrics for just a single namespace.

        I think a useful metric which could be exposed is max(time since last successful communication). This would help diagnose if one of the racks gets partitioned off from one of the NNs, for example – all of the DNs in that rack would start to rise in this metric.

        That said, the ones you've implemented here are fine and the most crucial, so +1 to the current patch and we can discuss adding some more DN-side metrics separately.

        Show
        tlipcon Todd Lipcon added a comment - Sorry, missed the comment above: Similarly, I couldn't think of anything useful an operator could get from this. It also doesn't help the situation that currently all DN metrics are per-DN-daemon, not per BP offer service. Thus, it's not obvious how to get meaningful DN-side metrics for just a single namespace. I think a useful metric which could be exposed is max(time since last successful communication) . This would help diagnose if one of the racks gets partitioned off from one of the NNs, for example – all of the DNs in that rack would start to rise in this metric. That said, the ones you've implemented here are fine and the most crucial, so +1 to the current patch and we can discuss adding some more DN-side metrics separately.
        Hide
        atm Aaron T. Myers added a comment -

        Thanks a lot for the review, Todd. Here's a patch which addresses your feedback.

        Show
        atm Aaron T. Myers added a comment - Thanks a lot for the review, Todd. Here's a patch which addresses your feedback.
        Hide
        tlipcon Todd Lipcon added a comment -
        +  public long getMillisSinceLastLoadedEdits() {
        +    if (haContext.getState().getServiceState() == HAServiceState.STANDBY) {
        

        Does this code possibly get called early during start-up before the ha context state has been set? (ie before the first start*Service)


        • in EditLogTailer, the new javadoc is redundant - just keep the @return bit
        Show
        tlipcon Todd Lipcon added a comment - + public long getMillisSinceLastLoadedEdits() { + if (haContext.getState().getServiceState() == HAServiceState.STANDBY) { Does this code possibly get called early during start-up before the ha context state has been set? (ie before the first start*Service) in EditLogTailer, the new javadoc is redundant - just keep the @return bit
        Hide
        atm Aaron T. Myers added a comment -

        Here's a patch which addresses the issue. In addition to the provided test, I also tested this manually on a cluster by hitting the /jmx URL and observing the values shown there for the new metrics.

        I implemented all the metrics above, except for the following:

        The difference between highest generation stamp seen from the shared edit log and the highest generation stamp seen from any DN

        I couldn't think of any legitimate use for this. It seems to serve only as a proxy for the size of the pending DN message queues.

        It would probably also be useful to have a DN metric which somehow describes which active/standby NNs its talking to, e.g. "times since last communicated with standby/active NNs."

        Similarly, I couldn't think of anything useful an operator could get from this. It also doesn't help the situation that currently all DN metrics are per-DN-daemon, not per BP offer service. Thus, it's not obvious how to get meaningful DN-side metrics for just a single namespace.

        I'm certainly open to suggestions for other metrics that people think might be useful.

        Show
        atm Aaron T. Myers added a comment - Here's a patch which addresses the issue. In addition to the provided test, I also tested this manually on a cluster by hitting the /jmx URL and observing the values shown there for the new metrics. I implemented all the metrics above, except for the following: The difference between highest generation stamp seen from the shared edit log and the highest generation stamp seen from any DN I couldn't think of any legitimate use for this. It seems to serve only as a proxy for the size of the pending DN message queues. It would probably also be useful to have a DN metric which somehow describes which active/standby NNs its talking to, e.g. "times since last communicated with standby/active NNs." Similarly, I couldn't think of anything useful an operator could get from this. It also doesn't help the situation that currently all DN metrics are per-DN-daemon, not per BP offer service. Thus, it's not obvious how to get meaningful DN-side metrics for just a single namespace. I'm certainly open to suggestions for other metrics that people think might be useful.

          People

          • Assignee:
            atm Aaron T. Myers
            Reporter:
            atm Aaron T. Myers
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development