[HDFS-17055] Export HAState as a metric from Namenode for monitoring - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 3.4.0, 3.3.9
Fix Version/s: 3.4.0
Component/s: hdfs
Labels:
- pull-request-available

Target Version/s:

3.4.0, 3.3.9
Hadoop Flags:

Reviewed

Description

We'd like measure the uptime for Namenodes: percentage of time when we have the active/standby/observer node available (up and running). We could monitor the namenode from an external service, such as ZKFC. But that would require the external service to be available 100% itself. And when this third-party external monitoring service is down, we won't have info on whether our Namenodes are still up.

We propose to take a different approach: we will emit Namenode state directly from namenode itself. Whenever we miss a data point for this metric, we consider the corresponding namenode to be down/not available. In other words, we assume the metric collection/monitoring infrastructure to be 100% reliable.

One implementation detail: in hadoop, we have the NameNodeMetrics class, which is currently used to emit all metrics for NameNode.java. However, we don't think that is a good place to emit NameNode HAState. HAState is stored in NameNode.java and we should directly emit it from NameNode.java. Otherwise, we basically duplicate this info in two classes and we would have to keep them in sync. Besides, NameNodeMetrics class does not have a reference to the NameNode object which it belongs to. An NameNodeMetrics is created by a static function initMetrics() in NameNode.java.

We shouldn't emit HA state from FSNameSystem.java either, as it is initialized from NameNode.java and all state transitions are implemented in NameNode.java.

Attachments

Issue Links

links to

GitHub Pull Request #5764

GitHub Pull Request #5790

Activity

People

Assignee:: Xing Lin

Reporter:: Xing Lin

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 21/Jun/23 21:47

Updated:: 03/Jul/23 16:51

Resolved:: 26/Jun/23 22:53