Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-8545

YARN native service should return container if launch failed

    XMLWordPrintableJSON

Details

    • Task
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • None
    • 3.2.0, 3.1.2
    • None
    • None

    Description

      In some cases, container launch may fail but container will not be properly returned to RM. 

      This could happen when AM trying to prepare container launch context but failed w/o sending container launch context to NM (Once container launch context is sent to NM, NM will report failed container to RM).

      Exception like: 

      java.io.FileNotFoundException: File does not exist: hdfs://ns1/user/wtan/.yarn/services/tf-job-001/components/1531852429056/primary-worker/primary-worker-0/run-PRIMARY_WORKER.sh
      	at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1583)
      	at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1576)
      	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
      	at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1591)
      	at org.apache.hadoop.yarn.service.utils.CoreFileSystem.createAmResource(CoreFileSystem.java:388)
      	at org.apache.hadoop.yarn.service.provider.ProviderUtils.createConfigFileAndAddLocalResource(ProviderUtils.java:253)
      	at org.apache.hadoop.yarn.service.provider.AbstractProviderService.buildContainerLaunchContext(AbstractProviderService.java:152)
      	at org.apache.hadoop.yarn.service.containerlaunch.ContainerLaunchService$ContainerLauncher.run(ContainerLaunchService.java:105)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      	at java.lang.Thread.run(Thread.java:745)

      And even after container launch context prepare failed, AM still trying to monitor container's readiness:

      2018-07-17 18:42:57,518 [pool-7-thread-1] INFO  monitor.ServiceMonitor - Readiness check failed for primary-worker-0: Probe Status, time="Tue Jul 17 18:42:57 UTC 2018", outcome="failure", message="Failure in Default probe: IP presence", exception="java.io.IOException: primary-worker-0: IP is not available yet"
      
      ...

      Attachments

        1. YARN-8545.001.patch
          26 kB
          Chandni Singh

        Activity

          People

            csingh Chandni Singh
            leftnoteasy Wangda Tan
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: