DateNode Process |
HDFS |
This host-level alert is triggered if the individual DataNode processes cannot be established to be up and listening on the network for the configured critical threshold. |
NameNode Process |
HDFS |
This host-level alert is triggered if the NameNode process cannot be confirmed to be up and listening on the network for the configured critical threshold. |
NameNode Host CPU Utilization |
HDFS |
This host-level alert is triggered if CPU utilization of the NameNode exceeds certain warning and critical thresholds. It checks the NameNode JMX Servlet for the SystemCPULoad property. |
NameNode Blocks Health |
HDFS |
This service-level alert is triggered if the number of corrupt or missing blocks exceeds the configured critical threshold. |
DataNode Storage |
HDFS |
This host-level alert is triggered if storage capacity if full on the DataNode. It checks the DataNode JMX Servlet for the Capacity and Remaining properties. |
NameNode Web UI |
HDFS |
This host-level alert is triggered if the NameNode Web UI is unreachable. |
Percent DataNodes With Available Space |
HDFS |
This service-level alert is triggered if the storage if full on a certain percentage of DataNodes exceed the warning and critical thresholds. |
Percent DataNodes Available |
HDFS |
This alert is triggered if the number of down DataNodes in the cluster is greater than the configured critical threshold. It aggregates the results of DataNode process checks. |
NameNode RPC Latency |
HDFS |
his host-level alert is triggered if the NameNode operations RPC latency exceeds the configured critical threshold. Typically an increase in the RPC processing time increases the RPC queue length, causing the average queue wait time to increase for NameNode operations. |
HDFS Capacity Utilization |
HDFS |
This service-level alert is triggered if the HDFS capacity utilization exceeds the configured warning and critical thresholds. It checks the NameNode JMX Servlet for the CapacityUsed and CapacityRemaining properties. |
DataNode Web UI |
HDFS |
This host-level alert is triggered if the DataNode Web UI is unreachable. |
Secondary NameNode Process |
HDFS |
This host-level alert is triggered if the Secondary NameNode process cannot be confirmed to be up and listening on the network for the configured critical threshold. |
JournalNode Process |
HDFS |
This host-level alert is triggered if the JournalNode process cannot be confirmed to be up and listening on the network for the configured critical threshold. |
ZooKeeper Failover Controller Process |
HDFS |
This host-level alert is triggered if the ZooKeeper Failover Controller process cannot be confirmed to be up and listening on the network for the configured critical threshold. |
Percent JournalNodes Available |
HDFS |
This alert is triggered if the number of down JournalNodes in the cluster is greater than the configured critical threshold. It aggregates the results of JournalNode process checks. |
NameNode High Availability Health |
HDFS |
This service-level alert is triggered if either the Active NameNode or Standby NameNode are not running. |
History Server Process |
MAPREDUCE2 |
This host-level alert is triggered if the HistoryServer process cannot be established to be up and listening on the network for the configured critical threshold |
History Server RPC Latency |
MAPREDUCE2 |
This host-level alert is triggered if the HistoryServer operations RPC latency exceeds the configured critical threshold. Typically an increase in the RPC processing time increases the RPC queue length, causing the average queue wait time to increase for operations. |
History Server CPU Utilization |
MAPREDUCE2 |
This host-level alert is triggered if the percent of CPU utilization on the HistoryServer exceeds the configured critical threshold. |
History Server Web UI |
MAPREDUCE2 |
This host-level alert is triggered if the HistoryServer Web UI is unreachable. |
ZooKeeper Server Process |
ZOOKEEPER |
This host-level alert is triggered if the ZooKeeper server process cannot be determined to be up and listening on the network for the configured critical threshold. |
Percent ZooKeeper Servers Available |
ZOOKEEPER |
This service-level alert is triggered if the configured percentage of ZooKeeper processes cannot be determined to be up and listening on the network for the configured critical threshold. It aggregates the results of ZooKeeper process checks. |
ResourceManager RPC Latency |
YARN |
This host-level alert is triggered if the ResourceManager operations RPC latency exceeds the configured critical threshold. Typically an increase in the RPC processing time increases the RPC queue length, causing the average queue wait time to increase for ResourceManager operations. |
ResourceManager CPU Utilization |
YARN |
This host-level alert is triggered if CPU utilization of the ResourceManager exceeds certain warning and critical thresholds. It checks the ResourceManager JMX Servlet for the SystemCPULoad property. |
NodeManager Health |
YARN |
This host-level alert checks the node health property available from the NodeManager component. |
Percent NodeManagers Available |
YARN |
This alert is triggered if the number of down NodeManagers in the cluster is greater than the configured critical threshold. It aggregates the results of NodeManager process checks. |
ResourceManager Web UI |
YARN |
This host-level alert is triggered if the ResourceManager Web UI is unreachable. |
App Timeline Web UI |
YARN |
This host-level alert is triggered if the App Timeline Server Web UI is unreachable. |
NodeManager Web UI |
YARN |
This host-level alert is triggered if the NodeManager Web UI is unreachable. |
NameNode Last Checkpoint |
HDFS |
Checks the last time that the NameNode performed a checkpoint. This script will also check for the number of uncommitted transactions. |
NameNode Directory Status |
HDFS |
It checks the NameNode JMX Servlet for the NameDirStatuses metric to see if any directories report a failure. |