Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
2.9.2
-
None
-
None
-
Hadoop:
Hadoop 2.9.2 (some line number may not be right because we have merged some 3.0+ patch)
Security with Kerberos
configure from https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/Federation.html
Java:
Java(TM) SE Runtime Environment (build 1.8.0_77-b03)
Java HotSpot(TM) 64-Bit Server VM (build 25.77-b03, mixed mode)
Kerberos:
Hadoop: Hadoop 2.9.2 (some line number may not be right because we have merged some 3.0+ patch) Security with Kerberos configure from https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/Federation.html Java: Java(TM) SE Runtime Environment (build 1.8.0_77-b03) Java HotSpot(TM) 64-Bit Server VM (build 25.77-b03, mixed mode) Kerberos:
Description
the NM will infinitely try to connect the wrong RM's resource tracker port
INFO [main:RetryInvocationHandler@411] - java.net.ConnectException: Call From standby.rm.server/10.122.138.139 to }}{{standby.rm.server:8031 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused, while invoking ResourceTrackerPBClientImpl.registerNodeManager over dev1 after 19 failover attempts. Trying to failover after sleeping for 40497ms.
After change yarn.client.failover-proxy-provider to org.apache.hadoop.yarn.server.federation.failover.FederationRMFailoverProxyProvider, the ** NodeManager cannot find the right ResourceTracker address:
getRMHAId:233, HAUtil (org.apache.hadoop.yarn.conf)
getConfKeyForRMInstance:294, HAUtil (org.apache.hadoop.yarn.conf)
getConfValueForRMInstance:302, HAUtil (org.apache.hadoop.yarn.conf)
getConfValueForRMInstance:314, HAUtil (org.apache.hadoop.yarn.conf)
getSocketAddr:3341, YarnConfiguration (org.apache.hadoop.yarn.conf)
getRMAddress:77, ServerRMProxy (org.apache.hadoop.yarn.server.api)
run:144, FederationRMFailoverProxyProvider$1 (org.apache.hadoop.yarn.server.federation.failover)
doPrivileged:-1, AccessController (java.security)
doAs:422, Subject (javax.security.auth)
doAs:1893, UserGroupInformation (org.apache.hadoop.security)
getProxyInternal:141, FederationRMFailoverProxyProvider (org.apache.hadoop.yarn.server.federation.failover)
performFailover:192, FederationRMFailoverProxyProvider (org.apache.hadoop.yarn.server.federation.failover)
failover:217, RetryInvocationHandler$ProxyDescriptor (org.apache.hadoop.io.retry)
processRetryInfo:149, RetryInvocationHandler$Call (org.apache.hadoop.io.retry)
processWaitTimeAndRetryInfo:142, RetryInvocationHandler$Call (org.apache.hadoop.io.retry)
invokeOnce:107, RetryInvocationHandler$Call (org.apache.hadoop.io.retry)
invoke:359, RetryInvocationHandler (org.apache.hadoop.io.retry)
registerNodeManager:-1, $Proxy85 (com.sun.proxy)
registerWithRM:378, NodeStatusUpdaterImpl (org.apache.hadoop.yarn.server.nodemanager)
serviceStart:252, NodeStatusUpdaterImpl (org.apache.hadoop.yarn.server.nodemanager)
start:194, AbstractService (org.apache.hadoop.service)
serviceStart:121, CompositeService (org.apache.hadoop.service)
start:194, AbstractService (org.apache.hadoop.service)
initAndStartNodeManager:864, NodeManager (org.apache.hadoop.yarn.server.nodemanager)
main:931, NodeManager (org.apache.hadoop.yarn.server.nodemanager)
the Provider will try to find the main RM address on getRMHAId:233, but it cannot find the right address because it can just return the local Address: {{}}
if (!s.isUnresolved() && NetUtils.isLocalAddress(s.getAddress())) {
{{ currentRMId = rmId.trim();}}
{{ found++;}}
}
If the NM and RM is on the same node, and the this RM is in standby situation, the NM will }}{{infinitely{{ call RPC to RM}}