Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-9823

NodeManager cannot get right ResourceTrack address in Federation mode

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.9.2
    • None
    • federation, nodemanager
    • None

    Description

      the NM will infinitely try to connect the wrong RM's resource tracker port

      INFO [main:RetryInvocationHandler@411] - java.net.ConnectException: Call From standby.rm.server/10.122.138.139 to }}{{standby.rm.server:8031 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused, while invoking ResourceTrackerPBClientImpl.registerNodeManager over dev1 after 19 failover attempts. Trying to failover after sleeping for 40497ms.

       

      After change yarn.client.failover-proxy-provider to org.apache.hadoop.yarn.server.federation.failover.FederationRMFailoverProxyProvider, the ** NodeManager cannot find the right ResourceTracker address:

      getRMHAId:233, HAUtil (org.apache.hadoop.yarn.conf)
      getConfKeyForRMInstance:294, HAUtil (org.apache.hadoop.yarn.conf)
      getConfValueForRMInstance:302, HAUtil (org.apache.hadoop.yarn.conf)
      getConfValueForRMInstance:314, HAUtil (org.apache.hadoop.yarn.conf)
      getSocketAddr:3341, YarnConfiguration (org.apache.hadoop.yarn.conf)
      getRMAddress:77, ServerRMProxy (org.apache.hadoop.yarn.server.api)
      run:144, FederationRMFailoverProxyProvider$1 (org.apache.hadoop.yarn.server.federation.failover)
      doPrivileged:-1, AccessController (java.security)
      doAs:422, Subject (javax.security.auth)
      doAs:1893, UserGroupInformation (org.apache.hadoop.security)
      getProxyInternal:141, FederationRMFailoverProxyProvider (org.apache.hadoop.yarn.server.federation.failover)
      performFailover:192, FederationRMFailoverProxyProvider (org.apache.hadoop.yarn.server.federation.failover)
      failover:217, RetryInvocationHandler$ProxyDescriptor (org.apache.hadoop.io.retry)
      processRetryInfo:149, RetryInvocationHandler$Call (org.apache.hadoop.io.retry)
      processWaitTimeAndRetryInfo:142, RetryInvocationHandler$Call (org.apache.hadoop.io.retry)
      invokeOnce:107, RetryInvocationHandler$Call (org.apache.hadoop.io.retry)
      invoke:359, RetryInvocationHandler (org.apache.hadoop.io.retry)
      registerNodeManager:-1, $Proxy85 (com.sun.proxy)
      registerWithRM:378, NodeStatusUpdaterImpl (org.apache.hadoop.yarn.server.nodemanager)
      serviceStart:252, NodeStatusUpdaterImpl (org.apache.hadoop.yarn.server.nodemanager)
      start:194, AbstractService (org.apache.hadoop.service)
      serviceStart:121, CompositeService (org.apache.hadoop.service)
      start:194, AbstractService (org.apache.hadoop.service)
      initAndStartNodeManager:864, NodeManager (org.apache.hadoop.yarn.server.nodemanager)
      main:931, NodeManager (org.apache.hadoop.yarn.server.nodemanager)

      the Provider will try to find the main RM address on getRMHAId:233, but it cannot find the right address because it can just return the local Address: {{}}

      if (!s.isUnresolved() && NetUtils.isLocalAddress(s.getAddress())) {
      {{ currentRMId = rmId.trim();}}
      {{ found++;}}
      }

      If the NM and RM is on the same node, and the this RM is in standby situation, the NM will }}{{infinitely{{ call RPC to RM}}

      Attachments

        Activity

          People

            Unassigned Unassigned
            yzzjjyy qiwei huang
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: