Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
3.4.0
-
Tested on multiple environments:
A. Docker Environment:
- Base OS: Ubuntu 20.04
- Java 8 installed from OpenJDK.
- Docker image includes Hadoop binaries, user configurations, and ports for YARN services.
- Verified behavior using a Hadoop snapshot in a containerized environment.
- Performed Namenode formatting and validated service interactions through exposed ports.
- Repo reference: arjunmohnot/hadoop-yarn-docker
B. Bare-metal Distributed Setup (RedHat Linux):
- Running Java 8 in a High-Availability (HA) configuration with Zookeeper for locking mechanism.
- Two ResourceManagers (RM) in HA: Failover tested between HA1 and HA2 RM node, including state retention and proper node state transitions.
- Verified node state transitions during RM failover, ensuring nodes moved between LOST, ACTIVE, and other states as expected.
Tested on multiple environments: A. Docker Environment : Base OS: Ubuntu 20.04 Java 8 installed from OpenJDK. Docker image includes Hadoop binaries, user configurations, and ports for YARN services. Verified behavior using a Hadoop snapshot in a containerized environment. Performed Namenode formatting and validated service interactions through exposed ports. Repo reference: arjunmohnot/hadoop-yarn-docker B. Bare-metal Distributed Setup (RedHat Linux) : Running Java 8 in a High-Availability (HA) configuration with Zookeeper for locking mechanism. Two ResourceManagers (RM) in HA: Failover tested between HA1 and HA2 RM node, including state retention and proper node state transitions. Verified node state transitions during RM failover, ensuring nodes moved between LOST, ACTIVE, and other states as expected.
-
Reviewed
Description
Issue Overview
When the ResourceManager (RM) starts, nodes listed in the "include" file are not immediately reported until their corresponding NodeManagers (NMs) send their first heartbeat. However, nodes in the "exclude" file are instantly reflected in the "Decommissioned Hosts" section with a port value -1.
This design creates several challenges:
- Untracked Nodemanagers: During Resourcemanager HA failover or RM standalone restart, some nodes may not report back, even though they are listed in the "include" file. These nodes neither appear in the LOST state nor are they represented in the RM's JMX metrics. This results in an untracked state, making it difficult to monitor their status. While in HDFS similar behaviour exists and is marked as "DEAD".
- Monitoring Gaps: Nodes in the "include" file are not visible until they send their first heartbeat. This delay impacts real-time cluster monitoring, leading to a lack of immediate visibility for these nodes in Resourcemanager's state on the total no. of nodes.
- Operational Impact: These unreported nodes cause operational difficulties, particularly in automated workflows such as OS Upgrade Automation (OSUA), node recovery automation, and others where validation depends on nodes being reflected in JMX as LOST, UNHEALTHY, or DECOMMISSIONED, etc. Nodes that don't report, however, require hacky workarounds to determine their accurate status.
Proposed Solution
To address these issues, we propose automatically assigning the LOST state to any node listed in the "include" file that are not registered and not part of the exclude file by default at the RM startup or HA failover. This can be done by marking the node with a special port value -2, signaling that the node is considered LOST but has not yet been reported. Whenever a heartbeat is received for that nodeID, it will be transitioned from LOST to RUNNING, UNHEALTHY, or any other required desired state.
Key implementation points
- Mark Unreported Nodes as LOST: Nodes in the "include" file not part of the RM active node context should be automatically marked as LOST. This can be achieved by modifying the NodesListManager under the refreshHostsReader method, invoked during failover, or manual node refresh operations. This logic should ensure that all unregistered nodes are moved to the LOST state, with port -2 indicating the node is untracked.
- For non-HA setups, this process can be triggered during RM service startup to mark nodes as LOST initially, and they will gradually transition to their desired state when the heartbeat is received.
- Handle Node Heartbeat and Transition: When a node sends its first heartbeat, the system should verify if the node is listed in getInactiveRMNodes(). If the node exists in the LOST state, the RM should remove it from the inactive list, decrement the LOST node count, and handle the transition back to the active node set.
- This logic can be placed in the state transition method within RMNodeImpl.java, ensuring that nodes transitioned from NEW to LOST state, and recover gracefully from the LOST state upon receiving their heartbeat.
Benefits
- Improved Cluster Monitoring: Automatically assigning a LOST state to nodes listed in the "include" file but not reporting ensures that every node in the cluster has a well-defined state (ACTIVE, LOST, DECOMMISSIONED, UNHEALTHY, etc). This eliminates any potential gaps in cluster node visibility and simplifies operational monitoring.
- Better Recovery Management: By marking unreported nodes as LOST, automation can quickly identify which nodes require attention during recovery efforts to restore cluster health. This prevents confusion between unreachable nodes and untracked nodes, improving recovery accuracy.
- Enhanced Cluster Stability: This approach improves overall stability by preventing nodes from slipping into an untracked or unknown state. It guarantees that the system remains aware of all nodes, reducing issues during RM failover or restart scenarios.
Additional Considerations
- Feature Flag Control: This feature will be enabled/disabled via a configuration flag, allowing users to adjust behavior based on their requirements. By default, it is marked as False.
- Enough Validations: The approach has been well-tested on non-HA and HA setups, and a dummy docker-based setup has been created to replicate the behavior. Added the required unit test cases to validate the code behavior. Demo video for this change.
Any thoughts/suggestions/feedback are welcome!
Attachments
Issue Links
- links to