Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-8545

YARN native service should return container if launch failed

    Details

    • Type: Task
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.2.0, 3.1.2
    • Component/s: None
    • Labels:
      None

      Description

      In some cases, container launch may fail but container will not be properly returned to RM. 

      This could happen when AM trying to prepare container launch context but failed w/o sending container launch context to NM (Once container launch context is sent to NM, NM will report failed container to RM).

      Exception like: 

      java.io.FileNotFoundException: File does not exist: hdfs://ns1/user/wtan/.yarn/services/tf-job-001/components/1531852429056/primary-worker/primary-worker-0/run-PRIMARY_WORKER.sh
      	at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1583)
      	at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1576)
      	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
      	at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1591)
      	at org.apache.hadoop.yarn.service.utils.CoreFileSystem.createAmResource(CoreFileSystem.java:388)
      	at org.apache.hadoop.yarn.service.provider.ProviderUtils.createConfigFileAndAddLocalResource(ProviderUtils.java:253)
      	at org.apache.hadoop.yarn.service.provider.AbstractProviderService.buildContainerLaunchContext(AbstractProviderService.java:152)
      	at org.apache.hadoop.yarn.service.containerlaunch.ContainerLaunchService$ContainerLauncher.run(ContainerLaunchService.java:105)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      	at java.lang.Thread.run(Thread.java:745)

      And even after container launch context prepare failed, AM still trying to monitor container's readiness:

      2018-07-17 18:42:57,518 [pool-7-thread-1] INFO  monitor.ServiceMonitor - Readiness check failed for primary-worker-0: Probe Status, time="Tue Jul 17 18:42:57 UTC 2018", outcome="failure", message="Failure in Default probe: IP presence", exception="java.io.IOException: primary-worker-0: IP is not available yet"
      
      ...

        Attachments

          Activity

            People

            • Assignee:
              csingh Chandni Singh
              Reporter:
              leftnoteasy Wangda Tan
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: