Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-4235

Killing app can lead to inconsistent app status between RM and HS

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Not a Problem
    • Affects Version/s: 0.23.3
    • Fix Version/s: None
    • Component/s: mrv2
    • Labels:
      None

      Description

      If a client tries to kill an application that is about to complete, the application states between the ResourceManager's web UI and the history server can be inconsistent. When the problem occurs, the ResourceManager shows the Status/FinalStatus as KILLED/KILLED and the history link will redirect to a broken link. The history link still references the ApplicationMaster which is now missing. The history server entry will show the application state as SUCCEEDED.

        Activity

        Hide
        Jason Lowe added a comment -

        This is an inherent race between the RM and the AM. The AM can succeed or fail just as the RM tries to kill it. Whether the RM reports the job as successful or killed is based on which way the race is resolved. In practice clients should not be surprised if a kill request ends up with the application in a non-killed terminal state such as FAILED/SUCCEEDED because of this race

        Show
        Jason Lowe added a comment - This is an inherent race between the RM and the AM. The AM can succeed or fail just as the RM tries to kill it. Whether the RM reports the job as successful or killed is based on which way the race is resolved. In practice clients should not be surprised if a kill request ends up with the application in a non-killed terminal state such as FAILED/SUCCEEDED because of this race
        Hide
        Jason Lowe added a comment -

        The ApplicationMaster log will have this exception when it shuts down:

        2012-05-08 16:19:34,666 ERROR [Thread-1] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Exception while unregistering 
        RemoteTrace: 
         at LocalTrace: 
        	org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: RemoteTrace: 
         at LocalTrace: 
        	org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Application doesn't exist in cache appattempt_1336511902223_0001_000001
        	at org.apache.hadoop.yarn.factories.impl.pb.YarnRemoteExceptionFactoryPBImpl.createYarnRemoteException(YarnRemoteExceptionFactoryPBImpl.java:39)
        	at org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:47)
        	at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.finishApplicationMaster(ApplicationMasterService.java:222)
        	at org.apache.hadoop.yarn.api.impl.pb.service.AMRMProtocolPBServiceImpl.finishApplicationMaster(AMRMProtocolPBServiceImpl.java:69)
        	at org.apache.hadoop.yarn.proto.AMRMProtocol$AMRMProtocolService$2.callBlockingMethod(AMRMProtocol.java:85)
        	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:427)
        	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:916)
        	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1692)
        	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1)
        	at java.security.AccessController.doPrivileged(Native Method)
        	at javax.security.auth.Subject.doAs(Subject.java:396)
        	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
        	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1687)
        
        	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
        	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
        	at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
        	at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:90)
        	at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:57)
        	at org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl.unwrapAndThrowException(YarnRemoteExceptionPBImpl.java:124)
        	at org.apache.hadoop.yarn.api.impl.pb.client.AMRMProtocolPBClientImpl.finishApplicationMaster(AMRMProtocolPBClientImpl.java:85)
        	at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.unregister(RMCommunicator.java:190)
        	at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.stop(RMCommunicator.java:216)
        	at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.stop(RMContainerAllocator.java:226)
        	at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$ContainerAllocatorRouter.stop(MRAppMaster.java:668)
        	at org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:99)
        	at org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:89)
        	at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$MRAppMasterShutdownHook.run(MRAppMaster.java:1036)
        	at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
        

        I believe the following sequence of events leads to the problem:

        1. AM sees all tasks complete, changes internal job state to SUCCEEDED, triggers job finished event (which currently waits 5 seconds and enlarges the race window)
        2. kill client command connects to the AM, sees that the job state != RUNNING, then tells RM to kill application
        3. RM fields kill request, transitions app state from RUNNING to KILLED/KILLED and unregisters app. Leaves tracking URL unchanged (probably should null it out as it does for AM's that exit unexpectedly)
        4. AM starts shutdown, tries to unregister with RM, and RM claims it doesn't know about the app (because it already unregistered it internally)
        5. HS reports app status as SUCCEEDED because jhist file shows job completed successfully.

        If the RM fields an unregister request for an application that was killed, we may want to consider updating the application's status and tracking URL based on the unregister request since it is likely to be more accurate (e.g.: SUCCEEDED instead of KILLED and tracking URL would point to the history server).

        Show
        Jason Lowe added a comment - The ApplicationMaster log will have this exception when it shuts down: 2012-05-08 16:19:34,666 ERROR [Thread-1] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Exception while unregistering RemoteTrace: at LocalTrace: org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: RemoteTrace: at LocalTrace: org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Application doesn't exist in cache appattempt_1336511902223_0001_000001 at org.apache.hadoop.yarn.factories.impl.pb.YarnRemoteExceptionFactoryPBImpl.createYarnRemoteException(YarnRemoteExceptionFactoryPBImpl.java:39) at org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:47) at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.finishApplicationMaster(ApplicationMasterService.java:222) at org.apache.hadoop.yarn.api.impl.pb.service.AMRMProtocolPBServiceImpl.finishApplicationMaster(AMRMProtocolPBServiceImpl.java:69) at org.apache.hadoop.yarn.proto.AMRMProtocol$AMRMProtocolService$2.callBlockingMethod(AMRMProtocol.java:85) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:427) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:916) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1692) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1687) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:90) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:57) at org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl.unwrapAndThrowException(YarnRemoteExceptionPBImpl.java:124) at org.apache.hadoop.yarn.api.impl.pb.client.AMRMProtocolPBClientImpl.finishApplicationMaster(AMRMProtocolPBClientImpl.java:85) at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.unregister(RMCommunicator.java:190) at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.stop(RMCommunicator.java:216) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.stop(RMContainerAllocator.java:226) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$ContainerAllocatorRouter.stop(MRAppMaster.java:668) at org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:99) at org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:89) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$MRAppMasterShutdownHook.run(MRAppMaster.java:1036) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) I believe the following sequence of events leads to the problem: AM sees all tasks complete, changes internal job state to SUCCEEDED, triggers job finished event (which currently waits 5 seconds and enlarges the race window) kill client command connects to the AM, sees that the job state != RUNNING, then tells RM to kill application RM fields kill request, transitions app state from RUNNING to KILLED/KILLED and unregisters app. Leaves tracking URL unchanged (probably should null it out as it does for AM's that exit unexpectedly) AM starts shutdown, tries to unregister with RM, and RM claims it doesn't know about the app (because it already unregistered it internally) HS reports app status as SUCCEEDED because jhist file shows job completed successfully. If the RM fields an unregister request for an application that was killed, we may want to consider updating the application's status and tracking URL based on the unregister request since it is likely to be more accurate (e.g.: SUCCEEDED instead of KILLED and tracking URL would point to the history server).

          People

          • Assignee:
            Unassigned
            Reporter:
            Jason Lowe
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development