[YARN-1284] LCE: Race condition leaves dangling cgroups entries for killed containers - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Blocker
Resolution: Fixed
Affects Version/s: 2.2.0
Fix Version/s: 2.3.0
Component/s: nodemanager
Labels:
None

Hadoop Flags:

Reviewed

Description

When LCE & cgroups are enabled, when a container is is killed (in this case by its owning AM, an MRAM) it seems to be a race condition at OS level when doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup.

LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, immediately attempts to clean up the cgroups entry for the container. But this is failing with an error like:

2013-10-07 15:21:24,359 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code from container container_1381179532433_0016_01_000011 is : 143
2013-10-07 15:21:24,359 DEBUG org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Processing container_1381179532433_0016_01_000011 of type UPDATE_DIAGNOSTICS_MSG
2013-10-07 15:21:24,359 DEBUG org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: deleteCgroup: /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_000011
2013-10-07 15:21:24,359 WARN org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: Unable to delete cgroup at: /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_000011

CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM containers to avoid this problem. it seems this should be done for all containers.

Still, waiting for extra 500ms seems too expensive.

We should look at a way of doing this in a more 'efficient way' from time perspective, may be spinning while the deleteCgroup() cannot be done with a minimal sleep and a timeout.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

YARN-1284.patch
08/Oct/13 14:37
9 kB
Alejandro Abdelnur
YARN-1284.patch
08/Oct/13 16:10
9 kB
Alejandro Abdelnur
YARN-1284.patch
08/Oct/13 17:14
9 kB
Alejandro Abdelnur
YARN-1284.patch
09/Oct/13 03:07
9 kB
Alejandro Abdelnur
YARN-1284.patch
09/Oct/13 03:18
9 kB
Alejandro Abdelnur

Activity

People

Assignee:: Alejandro Abdelnur

Reporter:: Alejandro Abdelnur

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 08/Oct/13 00:01

Updated:: 24/Feb/14 20:57

Resolved:: 09/Oct/13 05:13