[YARN-9506] Node Managers fail to update cached IP entries of Resource Managers - ASF JIRA

XML

Word

Printable

JSON

Hi,

We are running a Yarn Cluster (for Samza Jobs) on AWS. We are running it in HA mode, with yarn.resourcemanager.ha.automatic-failover.enabled= true

To reproduce the issue :

Have a running cluster with 2 NodeManagers and 2 Resource Managers in HA mode, with fail-over enabled.
- These Resource Managers need to have DNS entries defined, and set in the config:
  - ex: yarnrm1.me.local and yarnrm2.me.local
stop the active resource manager (yarnrm1.me.local), and retire its instance. (Node Managers will fallback to the standby yarnrm2.me.local)
provision a new resource manager with a new IP. Make sure the DNS entry yarnrm1.me.local is assigned to it.
stop the new active resource manager (yarnrm2.me.local).
Check the logs of NodeManagers failing to access the newly provisioned Resource Manager, and trying to access it through the old IP.

I can provide config files, yarn-site and core-site if needed.