Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
2.7.0
-
Reviewed
Description
Using yarn logs command for a long running application which has been running longer than the configured timeline service ttl yarn.timeline-service.ttl-ms fails with the following exception.
Exception in thread "main" org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: The entity for application application_152347939332_00001 doesn't exist in the timeline store at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore.getApplication(ApplicationHistoryManagerOnTimelineStore.java:670) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore.getContainers(ApplicationHistoryManagerOnTimelineStore.java:219) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryClientService.getContainers(ApplicationHistoryClientService.java:211) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationHistoryProtocolPBServiceImpl.getContainers(ApplicationHistoryProtocolPBServiceImpl.java:172) at org.apache.hadoop.yarn.proto.ApplicationHistoryProtocol$ApplicationHistoryProtocolService$2.callBlockingMethod(ApplicationHistoryProtocol.java:201) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2313) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2309) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2309) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:101) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationHistoryProtocolPBClientImpl.getContainers(ApplicationHistoryProtocolPBClientImpl.java:183) at org.apache.hadoop.yarn.client.api.impl.AHSClientImpl.getContainers(AHSClientImpl.java:151) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getContainers(YarnClientImpl.java:720) at org.apache.hadoop.yarn.client.cli.LogsCLI.getContainerReportsFromRunningApplication(LogsCLI.java:1089) at org.apache.hadoop.yarn.client.cli.LogsCLI.getContainersLogRequestForRunningApplication(LogsCLI.java:1064) at org.apache.hadoop.yarn.client.cli.LogsCLI.fetchApplicationLogs(LogsCLI.java:976) at org.apache.hadoop.yarn.client.cli.LogsCLI.runCommand(LogsCLI.java:300) at org.apache.hadoop.yarn.client.cli.LogsCLI.run(LogsCLI.java:107) at org.apache.hadoop.yarn.client.cli.LogsCLI.main(LogsCLI.java:327)
AHSClientImpl#getContainers failed because the application entity got deleted as it exceeded yarn.timeline-service.ttl-ms .
I checked in the debug logs that ClientRMService#getContainers is successful since the application is still running and is present in the ResourceManager.
We seem to be only catching IOException here. Ideally we should catch YarnException also in this case so that the response from RM is at least returned if the application is found in RM. Attaching a patch for the same.