Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-10301

"DIGEST-MD5: digest response format violation. Mismatched response." when network partition occurs

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 3.3.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      We observed the "Mismatched response." error in RM's log when a NM gets network-partitioned after RM failover. Here's how it happens:

       

      Initially, we have a sleeper YARN service running in a cluster with two RMs (an active RM1 and a standby RM2) and one NM. At some point, we perform a RM failover from RM1 to RM2.

      RM1's log:

      2020-06-01 16:29:20,387 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Transitioned to standby state

      RM2's log:

      2020-06-01 16:29:27,818 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Transitioned to active state

       

      After the RM failover, the NM encounters a network partition and fails to register with RM2. In other words, there's no "NodeManager from node *** registered" in RM2's log.

       

      This does not affect the sleeper YARN service. The sleeper service successfully recovers after the RM failover. We can see in RM2's log: 

      2020-06-01 16:30:06,703 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_6_0001_000001 State change from LAUNCHED to RUNNING on event = REGISTERED

       

      Then, we stop the sleeper service. In RM2's log, we can see that:

      2020-06-01 16:30:12,157 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: application_6_0001 unregistered successfully.
      ...
      2020-06-01 16:31:09,861 INFO org.apache.hadoop.yarn.service.webapp.ApiServer: Successfully stopped service sleeper1

      And in AM's log, we can see that: 

      2020-06-01 16:30:12,651 [shutdown-hook-0] INFO  service.ServiceMaster - SHUTDOWN_MSG:

       

      Some time later, we observe the "Mismatched response" in RM2's log:

      2020-06-01 16:43:20,699 WARN org.apache.hadoop.ipc.Client: Exception encountered while connecting to the server 
      org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): DIGEST-MD5: digest response format violation. Mismatched response.
        at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:376)
        at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:623)
        at org.apache.hadoop.ipc.Client$Connection.access$2400(Client.java:414)        
        at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:827)              
        at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:823)              
        at java.security.AccessController.doPrivileged(Native Method)                  
        at javax.security.auth.Subject.doAs(Subject.java:422)                          
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1845)
        at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:823)     
        at org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:414)        
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:1667)                
        at org.apache.hadoop.ipc.Client.call(Client.java:1483)                         
        at org.apache.hadoop.ipc.Client.call(Client.java:1436)                         
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
        at com.sun.proxy.$Proxy102.stopContainers(Unknown Source)                      
        at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.stopContainers(ContainerManagementProtocolPBClientImpl.java:147)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)                 
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)                            
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
        at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
        at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
        at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
        at com.sun.proxy.$Proxy103.stopContainers(Unknown Source)                      
        at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.cleanup(AMLauncher.java:153)
        at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:354)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)                                       
      2020-06-01 16:43:20,700 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Error cleaning master 
      javax.security.sasl.SaslException: DIGEST-MD5: digest response format violation. Mismatched response. [Caused by org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): DIGEST-MD5: digest response format violation. Mismatched response.]
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)       
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)             
        at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)    
        at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateIOException(RPCUtil.java:80)  
        at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:119)
        at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.stopContainers(ContainerManagementProtocolPBClientImpl.java:150)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)                 
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)                            
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
        at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
        at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
        at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
        at com.sun.proxy.$Proxy103.stopContainers(Unknown Source)                      
        at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.cleanup(AMLauncher.java:153)                                                                                                                                                                
        at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:354)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)                                       
      Caused by: org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): DIGEST-MD5: digest response format violation. Mismatched response.
        at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1593)               
        at org.apache.hadoop.ipc.Client.call(Client.java:1539)                         
        at org.apache.hadoop.ipc.Client.call(Client.java:1436)                         
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
        at com.sun.proxy.$Proxy102.stopContainers(Unknown Source)                      
        at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.stopContainers(ContainerManagementProtocolPBClientImpl.java:147)
        ... 15 more                                                                    

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              ycozy YCozy
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: