Description
In some cases, container launch may fail but container will not be properly returned to RM.
This could happen when AM trying to prepare container launch context but failed w/o sending container launch context to NM (Once container launch context is sent to NM, NM will report failed container to RM).
Exception like:
java.io.FileNotFoundException: File does not exist: hdfs://ns1/user/wtan/.yarn/services/tf-job-001/components/1531852429056/primary-worker/primary-worker-0/run-PRIMARY_WORKER.sh at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1583) at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1576) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1591) at org.apache.hadoop.yarn.service.utils.CoreFileSystem.createAmResource(CoreFileSystem.java:388) at org.apache.hadoop.yarn.service.provider.ProviderUtils.createConfigFileAndAddLocalResource(ProviderUtils.java:253) at org.apache.hadoop.yarn.service.provider.AbstractProviderService.buildContainerLaunchContext(AbstractProviderService.java:152) at org.apache.hadoop.yarn.service.containerlaunch.ContainerLaunchService$ContainerLauncher.run(ContainerLaunchService.java:105) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)
And even after container launch context prepare failed, AM still trying to monitor container's readiness:
2018-07-17 18:42:57,518 [pool-7-thread-1] INFO monitor.ServiceMonitor - Readiness check failed for primary-worker-0: Probe Status, time="Tue Jul 17 18:42:57 UTC 2018", outcome="failure", message="Failure in Default probe: IP presence", exception="java.io.IOException: primary-worker-0: IP is not available yet" ...