Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-4428

A failed job is not available under job history if the job is killed right around the time job is notified as failed

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 2.0.0-alpha
    • Fix Version/s: None
    • Labels:
      None

      Description

      We have observed this issue consistently running hadoop CDH4 version (based upon 2.0 alpha release):

      In case our hadoop client code gets a notification for a completed job ( using RunningJob object job, with (job.isComplete() && job.isSuccessful()==false)
      the hadoop client code does an unconditional job.killJob() to terminate the job.

      With earlier hadoop versions (verified on hadoop 0.20.2 version), we still have full access to job logs afterwards through hadoop console. However, when using MapReduceV2, the failed hadoop job no longer shows up under jobhistory server. Also, the tracking URL of the job still points to the non-existent Application master http port.

      Once we removed the call to job.killJob() for failed jobs from our hadoop client code, we were able to access the job in job history with mapreduce V2 as well. Therefore this appears to be a race condition in the job management wrt. job history for failed jobs.

      We do have the application master and node manager logs collected for this scenario if that'll help isolate the problem and the fix better.

      1. am_failed_counter_limits.txt
        2.40 MB
        Rahul Jain
      2. resrcmgr_bad.txt
        202 kB
        Rahul Jain
      3. appMaster_good.txt
        578 kB
        Rahul Jain
      4. appMaster_bad.txt
        512 kB
        Rahul Jain

        Issue Links

          Activity

          Hide
          Rahul Jain added a comment -

          MAPREDUCE-4443 created to track the AM reliability for counters limit exceeded issue.

          Show
          Rahul Jain added a comment - MAPREDUCE-4443 created to track the AM reliability for counters limit exceeded issue.
          Hide
          Rahul Jain added a comment -

          OK, will create a separate one for the counter limit exceeded issue.

          BTW, I did open MAPREDUCE-4442 for a related issue: we are unable to access job counters for the period AM is possibly shutting down as well, it may be a good idea to consider that issue in the final fix.

          Show
          Rahul Jain added a comment - OK, will create a separate one for the counter limit exceeded issue. BTW, I did open MAPREDUCE-4442 for a related issue: we are unable to access job counters for the period AM is possibly shutting down as well, it may be a good idea to consider that issue in the final fix.
          Hide
          Robert Joseph Evans added a comment -

          OK I think I have an idea, but Sid I would like your opinion on this. If you want to pull in Arun on this too I am happy for his opinion too.

          What if we augment the ContainerLaunchContext to have something like a cleanup on kill boolean and a cleanup on bad exit boolean. If cleanup on kill is set and the container is forcibly killed or if cleanup on bad exit is set and the container exits with a non-zero status, the NM would try to rerun the container, but with an environment variable set saying that it is being rerun for cleanup. The NM would give it a configurable amount of time, say 20 seconds, to do the cleanup, and then if it has not already exited it will shoot it.

          The RM would need a new variable when the AM is submitted to indicate that this should happen, and then if that is set it would turn on cleanup on kill for the AM when it is launched, and it would turn on cleanup on bad exit, when it is launching the AM for the last retry.

          The MR AM would have to be modified to look for the environment variable and only do cleanup if it sees it. The MR client would have to be modified to set this boolean variable.

          Show
          Robert Joseph Evans added a comment - OK I think I have an idea, but Sid I would like your opinion on this. If you want to pull in Arun on this too I am happy for his opinion too. What if we augment the ContainerLaunchContext to have something like a cleanup on kill boolean and a cleanup on bad exit boolean. If cleanup on kill is set and the container is forcibly killed or if cleanup on bad exit is set and the container exits with a non-zero status, the NM would try to rerun the container, but with an environment variable set saying that it is being rerun for cleanup. The NM would give it a configurable amount of time, say 20 seconds, to do the cleanup, and then if it has not already exited it will shoot it. The RM would need a new variable when the AM is submitted to indicate that this should happen, and then if that is set it would turn on cleanup on kill for the AM when it is launched, and it would turn on cleanup on bad exit, when it is launching the AM for the last retry. The MR AM would have to be modified to look for the environment variable and only do cleanup if it sees it. The MR client would have to be modified to set this boolean variable.
          Hide
          Robert Joseph Evans added a comment -

          You should not need to restart all of yarn to update the counters max. You should be able to set it on a per application basis assuming that you do not have it marked as final in mapred-site.xml, although you may get similar errors in the History Server if you do that.

          Could you please file a separate JIRA for the counter's limit issue. We should have a cleaner way to deal with the counter's limit being exceeded.

          I agree with you that this is a fix that needs to happen, Sadly it is just not a simple fix. I will talk with some co-workers about this to see that we can come up with.

          Show
          Robert Joseph Evans added a comment - You should not need to restart all of yarn to update the counters max. You should be able to set it on a per application basis assuming that you do not have it marked as final in mapred-site.xml, although you may get similar errors in the History Server if you do that. Could you please file a separate JIRA for the counter's limit issue. We should have a cleaner way to deal with the counter's limit being exceeded. I agree with you that this is a fix that needs to happen, Sadly it is just not a simple fix. I will talk with some co-workers about this to see that we can come up with.
          Hide
          Rahul Jain added a comment -

          Am attaching another case (not directly related to the ticket) that illustrates the point on more reliable way of getting job history.

          Our map-reduce jobs generate counters beyond the standard 120 limit imposed by yarn. We increased the counters to 1000 but the jobs went over that limit as well.

          As you can see in attached file am_failed_counter_limits.txt, the AM encountered exceptions trying to handle the large # of counters; at the end no job history was accessible either.

          Note that we did increase counters for the job submitter task to 2000, however the only way to have AM use a higher # of counters is to change the global yarn settings and restart yarn.

          Show
          Rahul Jain added a comment - Am attaching another case (not directly related to the ticket) that illustrates the point on more reliable way of getting job history. Our map-reduce jobs generate counters beyond the standard 120 limit imposed by yarn. We increased the counters to 1000 but the jobs went over that limit as well. As you can see in attached file am_failed_counter_limits.txt, the AM encountered exceptions trying to handle the large # of counters; at the end no job history was accessible either. Note that we did increase counters for the job submitter task to 2000, however the only way to have AM use a higher # of counters is to change the global yarn settings and restart yarn.
          Hide
          Rahul Jain added a comment -

          Robert,

          To make the user perspective clear here:

          This grid is a single user managed grid, no other process was running at the time and no one else tried to do kill on the job here.

          The sequence is:

          a) The job creator application submitted the job to hadoop grid.

          b) Max retry count was set to 1 for both mappers and reducers; so as soon as a task failed, the system (AM?) decided to kill all other tasks.

          c) The submitter application is waiting in a sleep loop, waking up every 1 second to check the status of the task
          calling: JobClient.getJob()

          d) When the above condition happens, the application receives the running job status as completed , failed (isSuccessful()=false, isComplete()=true on RunningJob object)

          e) The application issues a killJob() on the running job object at this time

          f) As a result, nothing is accessible in job history from hadoop console, even the AM container logs cannot be accessed.

          Removing (e) from the above sequence make logs accessible again. As I mentioned, with older version of map-reduce, we never encountered the issues of logs getting lost. I believe we need to handle the case of user initiated 'KILL' of the job better in MapReduceV2; 90% of the time we look at map-reduce logs only for failed and killed jobs; so this functionality should work reliably as much as possible.

          Show
          Rahul Jain added a comment - Robert, To make the user perspective clear here: This grid is a single user managed grid, no other process was running at the time and no one else tried to do kill on the job here. The sequence is: a) The job creator application submitted the job to hadoop grid. b) Max retry count was set to 1 for both mappers and reducers; so as soon as a task failed, the system (AM?) decided to kill all other tasks. c) The submitter application is waiting in a sleep loop, waking up every 1 second to check the status of the task calling: JobClient.getJob() d) When the above condition happens, the application receives the running job status as completed , failed (isSuccessful()=false, isComplete()=true on RunningJob object) e) The application issues a killJob() on the running job object at this time f) As a result, nothing is accessible in job history from hadoop console, even the AM container logs cannot be accessed. Removing (e) from the above sequence make logs accessible again. As I mentioned, with older version of map-reduce, we never encountered the issues of logs getting lost. I believe we need to handle the case of user initiated 'KILL' of the job better in MapReduceV2; 90% of the time we look at map-reduce logs only for failed and killed jobs; so this functionality should work reliably as much as possible.
          Hide
          Robert Joseph Evans added a comment -

          Just as an FYI if the user does an mapreduce job -kill and the AM reports back that it is no running, then it will ask the RM to kill the job instead of asking the AM to commit suicide. So if you lost two races then it could be caused by someone running mapreduce job -kill.

          Show
          Robert Joseph Evans added a comment - Just as an FYI if the user does an mapreduce job -kill and the AM reports back that it is no running, then it will ask the RM to kill the job instead of asking the AM to commit suicide. So if you lost two races then it could be caused by someone running mapreduce job -kill.
          Hide
          Robert Joseph Evans added a comment -

          It looks like someone killed your application

          2012-07-11 03:04:28,481 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hadoop	IP=10.202.50.180	OPERATION=Kill Application Request	TARGET=ClientRMService	RESULT=SUCCESS	APPID=application_1341894680756_0017
          2012-07-11 03:04:28,481 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1341894680756_0017 State change from RUNNING to KILLED
          

          This caused the RM to forget about the application, and it happened just as your application was about to fail, so the AM asked to unregister, but the RM said I don't know who you are, when in reality it should have said didn't I try to kill you? I don't know who tried to kill this application or really why it went to the RM instead of the AM. The issue here is that normally for mapreduce job -kill the client is first going to request that the AM commit suicide. That way it can put the logs where they are supposed to be, before it tries to ask the RM to kill the application. If you do a yarn application kill there is no guarantee what the AM will or will not be able to do before it is killed. If the AM had been slower the NodeManager would have just sent a kill -9 to the AM, and then it would not have had any chance at putting the logs in the correct place. You should probably look at who was on 10.202.50.180 and what they were doing that might have asked the RM to kill this AM.

          Fixing this in the general case so that the job history logs always are copied to the correct place is going to be difficult. This is because we have to insert something that will always run after the AM has exited, it is probably best to make it so it will only run after the AM has exited badly, even for a kill. It is possible, just not that simple of a fix. It is even more difficult if we want to handle the case where the node appears to go down just as the AM is crashing. there are lots of corner cases that potentially make this very difficult to get right.

          Show
          Robert Joseph Evans added a comment - It looks like someone killed your application 2012-07-11 03:04:28,481 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hadoop IP=10.202.50.180 OPERATION=Kill Application Request TARGET=ClientRMService RESULT=SUCCESS APPID=application_1341894680756_0017 2012-07-11 03:04:28,481 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1341894680756_0017 State change from RUNNING to KILLED This caused the RM to forget about the application, and it happened just as your application was about to fail, so the AM asked to unregister, but the RM said I don't know who you are, when in reality it should have said didn't I try to kill you? I don't know who tried to kill this application or really why it went to the RM instead of the AM. The issue here is that normally for mapreduce job -kill the client is first going to request that the AM commit suicide. That way it can put the logs where they are supposed to be, before it tries to ask the RM to kill the application. If you do a yarn application kill there is no guarantee what the AM will or will not be able to do before it is killed. If the AM had been slower the NodeManager would have just sent a kill -9 to the AM, and then it would not have had any chance at putting the logs in the correct place. You should probably look at who was on 10.202.50.180 and what they were doing that might have asked the RM to kill this AM. Fixing this in the general case so that the job history logs always are copied to the correct place is going to be difficult. This is because we have to insert something that will always run after the AM has exited, it is probably best to make it so it will only run after the AM has exited badly, even for a kill. It is possible, just not that simple of a fix. It is even more difficult if we want to handle the case where the node appears to go down just as the AM is crashing. there are lots of corner cases that potentially make this very difficult to get right.
          Hide
          Rahul Jain added a comment -

          Here are the resource manager logs appended for failure case. Note that resource manager was not restarted any time; and the same stack trace can be found on the resource manager when the application attempts to unregister

          org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl:     Application doesn't exist in cache appattempt_1341894680756_0017_000001....
          
          Show
          Rahul Jain added a comment - Here are the resource manager logs appended for failure case. Note that resource manager was not restarted any time; and the same stack trace can be found on the resource manager when the application attempts to unregister org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Application doesn't exist in cache appattempt_1341894680756_0017_000001....
          Hide
          Robert Joseph Evans added a comment -

          From the logs it looks like you had many tasks failing because of the following exception.

          2012-07-11 03:04:27,122 FATAL [IPC Server handler 4 on 37900] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1341894680756_0017_m_000012_0 - exited : java.io.IOException: Exception occured validating bulk loader output
          	at com.carrieriq.m2m.platform.mmp3.output.db.BulkLoaderOutputWriter.doFinish(BulkLoaderOutputWriter.java:408)
          	at com.carrieriq.m2m.platform.mmp3.output.db.BulkLoaderOutputWriter.finish(BulkLoaderOutputWriter.java:331)
          	at com.carrieriq.m2m.platform.mmp3.output.db.DfsAndDbOutputWriter.finish(DfsAndDbOutputWriter.java:89)
          	at com.carrieriq.m2m.platform.mmp3.output.fact2db.LoadFactsToDatamartMapper.onClose(LoadFactsToDatamartMapper.java:769)
          	at com.carrieriq.m2m.platform.util.hadoop.AbstractReportingMapper.close(AbstractReportingMapper.java:94)
          	at org.apache.hadoop.mapred.lib.Chain.close(Chain.java:283)
          	at org.apache.hadoop.mapred.lib.ChainMapper.close(ChainMapper.java:179)
          	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
          	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:399)
          	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:334)
          	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:152)
          	at java.security.AccessController.doPrivileged(Native Method)
          	at javax.security.auth.Subject.doAs(Subject.java:396)
          	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
          	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:147)
          Caused by: java.io.IOException: Error(s) found in bulk loader output files.
          	at com.carrieriq.m2m.platform.mmp3.output.db.BulkLoaderOutputWriter.doFinish(BulkLoaderOutputWriter.java:405)
          	... 14 more
          

          This caused the job itself to fail, and as the AM was trying to tell the RM that it was exiting the RM said who are you, I've never heard of appattempt_1341894680756_0017_000001. It looks almost like the RM somehow was restarted in the middle of the job running, or that it somehow forgot about this particular Application. Having the RM logs for around the time this was running would help trace down what happened in the RM.

          2012-07-11 03:04:28,574 INFO [Thread-1] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Setting job diagnostics to Task failed task_1341894680756_0017_m_000014
          Job failed as tasks failed. failedMaps:1 failedReduces:0
          
          2012-07-11 03:04:28,575 INFO [Thread-1] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: History url is sjc1-ciq-ibm-grid06.carrieriq.com:19888/jobhistory/job/job_1341894680756_0017
          2012-07-11 03:04:28,580 ERROR [Thread-1] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Exception while unregistering 
          RemoteTrace: 
           at LocalTrace: 
          	org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: RemoteTrace: 
           at LocalTrace: 
          	org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Application doesn't exist in cache appattempt_1341894680756_0017_000001
          	at org.apache.hadoop.yarn.factories.impl.pb.YarnRemoteExceptionFactoryPBImpl.createYarnRemoteException(YarnRemoteExceptionFactoryPBImpl.java:39)
          	at org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:47)
          	at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.finishApplicationMaster(ApplicationMasterService.java:222)
          	at org.apache.hadoop.yarn.api.impl.pb.service.AMRMProtocolPBServiceImpl.finishApplicationMaster(AMRMProtocolPBServiceImpl.java:69)
          	at org.apache.hadoop.yarn.proto.AMRMProtocol$AMRMProtocolService$2.callBlockingMethod(AMRMProtocol.java:85)
          	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:427)
          	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:916)
          	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1692)
          	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1688)
          	at java.security.AccessController.doPrivileged(Native Method)
          	at javax.security.auth.Subject.doAs(Subject.java:396)
          	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
          	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1686)
          
          	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
          	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
          	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
          	at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
          	at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:90)
          	at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:57)
          	at org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl.unwrapAndThrowException(YarnRemoteExceptionPBImpl.java:123)
          	at org.apache.hadoop.yarn.api.impl.pb.client.AMRMProtocolPBClientImpl.finishApplicationMaster(AMRMProtocolPBClientImpl.java:85)
          	at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.unregister(RMCommunicator.java:190)
          	at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.stop(RMCommunicator.java:216)
          	at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.stop(RMContainerAllocator.java:226)
          	at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$ContainerAllocatorRouter.stop(MRAppMaster.java:668)
          	at org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:99)
          	at org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:89)
          	at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$MRAppMasterShutdownHook.run(MRAppMaster.java:1036)
          	at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
          
          Show
          Robert Joseph Evans added a comment - From the logs it looks like you had many tasks failing because of the following exception. 2012-07-11 03:04:27,122 FATAL [IPC Server handler 4 on 37900] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1341894680756_0017_m_000012_0 - exited : java.io.IOException: Exception occured validating bulk loader output at com.carrieriq.m2m.platform.mmp3.output.db.BulkLoaderOutputWriter.doFinish(BulkLoaderOutputWriter.java:408) at com.carrieriq.m2m.platform.mmp3.output.db.BulkLoaderOutputWriter.finish(BulkLoaderOutputWriter.java:331) at com.carrieriq.m2m.platform.mmp3.output.db.DfsAndDbOutputWriter.finish(DfsAndDbOutputWriter.java:89) at com.carrieriq.m2m.platform.mmp3.output.fact2db.LoadFactsToDatamartMapper.onClose(LoadFactsToDatamartMapper.java:769) at com.carrieriq.m2m.platform.util.hadoop.AbstractReportingMapper.close(AbstractReportingMapper.java:94) at org.apache.hadoop.mapred.lib.Chain.close(Chain.java:283) at org.apache.hadoop.mapred.lib.ChainMapper.close(ChainMapper.java:179) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:399) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:334) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:152) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:147) Caused by: java.io.IOException: Error(s) found in bulk loader output files. at com.carrieriq.m2m.platform.mmp3.output.db.BulkLoaderOutputWriter.doFinish(BulkLoaderOutputWriter.java:405) ... 14 more This caused the job itself to fail, and as the AM was trying to tell the RM that it was exiting the RM said who are you, I've never heard of appattempt_1341894680756_0017_000001. It looks almost like the RM somehow was restarted in the middle of the job running, or that it somehow forgot about this particular Application. Having the RM logs for around the time this was running would help trace down what happened in the RM. 2012-07-11 03:04:28,574 INFO [Thread-1] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Setting job diagnostics to Task failed task_1341894680756_0017_m_000014 Job failed as tasks failed. failedMaps:1 failedReduces:0 2012-07-11 03:04:28,575 INFO [Thread-1] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: History url is sjc1-ciq-ibm-grid06.carrieriq.com:19888/jobhistory/job/job_1341894680756_0017 2012-07-11 03:04:28,580 ERROR [Thread-1] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Exception while unregistering RemoteTrace: at LocalTrace: org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: RemoteTrace: at LocalTrace: org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Application doesn't exist in cache appattempt_1341894680756_0017_000001 at org.apache.hadoop.yarn.factories.impl.pb.YarnRemoteExceptionFactoryPBImpl.createYarnRemoteException(YarnRemoteExceptionFactoryPBImpl.java:39) at org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:47) at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.finishApplicationMaster(ApplicationMasterService.java:222) at org.apache.hadoop.yarn.api.impl.pb.service.AMRMProtocolPBServiceImpl.finishApplicationMaster(AMRMProtocolPBServiceImpl.java:69) at org.apache.hadoop.yarn.proto.AMRMProtocol$AMRMProtocolService$2.callBlockingMethod(AMRMProtocol.java:85) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:427) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:916) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1692) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1688) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1686) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:90) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:57) at org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl.unwrapAndThrowException(YarnRemoteExceptionPBImpl.java:123) at org.apache.hadoop.yarn.api.impl.pb.client.AMRMProtocolPBClientImpl.finishApplicationMaster(AMRMProtocolPBClientImpl.java:85) at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.unregister(RMCommunicator.java:190) at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.stop(RMCommunicator.java:216) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.stop(RMContainerAllocator.java:226) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$ContainerAllocatorRouter.stop(MRAppMaster.java:668) at org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:99) at org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:89) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$MRAppMasterShutdownHook.run(MRAppMaster.java:1036) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
          Hide
          Rahul Jain added a comment -

          Included both a good case from web interface (appMaster_good.txt) where no kill was done on the job;
          And the bad case logs collected from hdfs sifting (appMaster_bad.txt)

          Show
          Rahul Jain added a comment - Included both a good case from web interface (appMaster_good.txt) where no kill was done on the job; And the bad case logs collected from hdfs sifting (appMaster_bad.txt)
          Hide
          Rahul Jain added a comment -

          Robert,

          The link to AM logs for failed + killed case does not work , instead the error goes like:

          Failed while trying to construct the redirect url to the log server. Log Server url may not be configured
          Unknown container. Container either has not started or has already completed or doesn't belong to this node at all.
          

          This is what makes the problem diagnostics quite hard for users.

          However I did sifting through hdfs (log aggregation was enabled) to search for AM container logs; I believe I have the right one which I will upload next.

          Show
          Rahul Jain added a comment - Robert, The link to AM logs for failed + killed case does not work , instead the error goes like: Failed while trying to construct the redirect url to the log server. Log Server url may not be configured Unknown container. Container either has not started or has already completed or doesn't belong to this node at all. This is what makes the problem diagnostics quite hard for users. However I did sifting through hdfs (log aggregation was enabled) to search for AM container logs; I believe I have the right one which I will upload next.
          Hide
          Robert Joseph Evans added a comment -

          This can be caused by a number of things. Most likely it is caused by a bug in the AM that is making it crash before it can move the job history log to where the history server will pick it up. If that is the case you need to get access to the AM logs so that we can look at what is happening there. I am not familiar with what CDH4 has in the UI, but if you click on the application id in the main RM web page you should see a link to the AM's logs.

          Show
          Robert Joseph Evans added a comment - This can be caused by a number of things. Most likely it is caused by a bug in the AM that is making it crash before it can move the job history log to where the history server will pick it up. If that is the case you need to get access to the AM logs so that we can look at what is happening there. I am not familiar with what CDH4 has in the UI, but if you click on the application id in the main RM web page you should see a link to the AM's logs.

            People

            • Assignee:
              Robert Joseph Evans
              Reporter:
              Rahul Jain
            • Votes:
              1 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:

                Development