Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-14982

Clients using FailoverOnNetworkExceptionRetry can go into a loop if they're used without authenticating with kerberos in HA env

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 3.1.0, 2.10.0
    • common
    • None
    • Reviewed

    Description

      If HA is configured for the Resource Manager in a secure environment, using the mapred client goes into a loop if the user is not authenticated with Kerberos.

      [root@pb6sec-1 ~]# mapred job -list
      17/10/25 06:37:43 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm36
      17/10/25 06:37:43 WARN ipc.Client: Exception encountered while connecting to the server : javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
      17/10/25 06:37:43 INFO retry.RetryInvocationHandler: java.io.IOException: Failed on local exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local host is: "host_redacted/IP_redacted"; destination host is: "com.host2.redacted:8032; , while invoking ApplicationClientProtocolPBClientImpl.getApplications over rm36 after 1 failover attempts. Trying to failover after sleeping for 160ms.
      17/10/25 06:37:43 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm25
      17/10/25 06:37:43 INFO retry.RetryInvocationHandler: java.net.ConnectException: Call From host_redacted/IP_redacted to com.host.redacted:8032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused, while invoking ApplicationClientProtocolPBClientImpl.getApplications over rm25 after 2 failover attempts. Trying to failover after sleeping for 582ms.
      17/10/25 06:37:44 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm36
      17/10/25 06:37:44 WARN ipc.Client: Exception encountered while connecting to the server : javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
      17/10/25 06:37:44 INFO retry.RetryInvocationHandler: java.io.IOException: Failed on local exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local host is: "host_redacted/IP_redacted"; destination host is: "com.host2.redacted:8032; , while invoking ApplicationClientProtocolPBClientImpl.getApplications over rm36 after 3 failover attempts. Trying to failover after sleeping for 977ms.
      17/10/25 06:37:45 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm25
      17/10/25 06:37:45 INFO retry.RetryInvocationHandler: java.net.ConnectException: Call From host_redacted/IP_redacted to com.host.redacted:8032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused, while invoking ApplicationClientProtocolPBClientImpl.getApplications over rm25 after 4 failover attempts. Trying to failover after sleeping for 1667ms.
      17/10/25 06:37:46 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm36
      17/10/25 06:37:46 WARN ipc.Client: Exception encountered while connecting to the server : javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
      17/10/25 06:37:46 INFO retry.RetryInvocationHandler: java.io.IOException: Failed on local exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local host is: "host_redacted/IP_redacted"; destination host is: "com.host2.redacted:8032; , while invoking ApplicationClientProtocolPBClientImpl.getApplications over rm36 after 5 failover attempts. Trying to failover after sleeping for 2776ms.
      17/10/25 06:37:49 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm25
      17/10/25 06:37:49 INFO retry.RetryInvocationHandler: java.net.ConnectException: Call From host_redacted/IP_redacted to com.host.redacted:8032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused, while invoking ApplicationClientProtocolPBClientImpl.getApplications over rm25 after 6 failover attempts. Trying to failover after sleeping for 1055ms.
      17/10/25 06:37:50 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm36
      17/10/25 06:37:50 WARN ipc.Client: Exception encountered while connecting to the server : javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
      17/10/25 06:37:50 INFO retry.RetryInvocationHandler: java.io.IOException: Failed on local exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local host is: "host_redacted/IP_redacted"; destination host is: "com.host2.redacted:8032; , while invoking ApplicationClientProtocolPBClientImpl.getApplications over rm36 after 7 failover attempts. Trying to failover after sleeping for 2608ms.
      ...
      

      The reason is that the retry handler sees a ConnectException, then fails over to the inactive RM. This obviously doesn't work, so it comes back to the active and whole process starts again. The RetryHandler should examine if the ConnectException is actually caused by a GSSException (and probably check the "No valid credentials provided" message) and if so, it should not perform a failover.

      Attachments

        1. HADOOP-14892-001.patch
          7 kB
          Peter Bacsko
        2. HADOOP-14892-002.patch
          6 kB
          Peter Bacsko
        3. HADOOP-14982-003.patch
          6 kB
          Peter Bacsko

        Issue Links

          Activity

            People

              pbacsko Peter Bacsko
              pbacsko Peter Bacsko
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: