[YARN-8545] YARN native service should return container if launch failed - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Task
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 3.2.0, 3.1.2
Component/s: None
Labels:
None

Target Version/s:

3.2.0, 3.1.2

Description

In some cases, container launch may fail but container will not be properly returned to RM.

This could happen when AM trying to prepare container launch context but failed w/o sending container launch context to NM (Once container launch context is sent to NM, NM will report failed container to RM).

Exception like:

java.io.FileNotFoundException: File does not exist: hdfs://ns1/user/wtan/.yarn/services/tf-job-001/components/1531852429056/primary-worker/primary-worker-0/run-PRIMARY_WORKER.sh
	at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1583)
	at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1576)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1591)
	at org.apache.hadoop.yarn.service.utils.CoreFileSystem.createAmResource(CoreFileSystem.java:388)
	at org.apache.hadoop.yarn.service.provider.ProviderUtils.createConfigFileAndAddLocalResource(ProviderUtils.java:253)
	at org.apache.hadoop.yarn.service.provider.AbstractProviderService.buildContainerLaunchContext(AbstractProviderService.java:152)
	at org.apache.hadoop.yarn.service.containerlaunch.ContainerLaunchService$ContainerLauncher.run(ContainerLaunchService.java:105)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

And even after container launch context prepare failed, AM still trying to monitor container's readiness:

2018-07-17 18:42:57,518 [pool-7-thread-1] INFO  monitor.ServiceMonitor - Readiness check failed for primary-worker-0: Probe Status, time="Tue Jul 17 18:42:57 UTC 2018", outcome="failure", message="Failure in Default probe: IP presence", exception="java.io.IOException: primary-worker-0: IP is not available yet"

...

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

YARN-8545.001.patch
23/Jul/18 21:31
26 kB
Chandni Singh

Activity

People

Assignee:: Chandni Singh

Reporter:: Wangda Tan

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 17/Jul/18 18:42

Updated:: 30/Jul/18 18:53

Resolved:: 26/Jul/18 22:42