Description
Lost nodemanagers fail to join back.
When the NM is lost, RM log reads
INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: Expired:<host:port> Timed out after 600 secs INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Processing <host:port> of type EXPIRE INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Removed Node <host:port> INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: <host:port> Node Transitioned from RUNNING to LOST
When the NM joins back, RM log reads
INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: Node not found rebooting <host:port>
Attachments
Issue Links
- relates to
-
MAPREDUCE-3034 NM should act on a REBOOT command from RM
- Resolved