[YARN-8515] container-executor can crash with SIGPIPE after nodemanager restart - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.10.0, 3.2.0, 3.1.1, 2.9.2, 2.8.5, 3.0.4
Component/s: None
Labels:
- Docker

Hadoop Flags:

Reviewed

Description

When running with docker on large clusters, we have noticed that sometimes docker containers are not removed - they remain in the exited state, and the corresponding container-executor is no longer running. Upon investigation, we noticed that this always seemed to happen after a nodemanager restart. The sequence leading to the stranded docker containers is:

Nodemanager restarts
Containers are recovered and then run for a while
Containers are killed for some (legitimate) reason
Container-executor exits without removing the docker container.

After reproducing this on a test cluster, we found that the container-executor was exiting due to a SIGPIPE.

What is happening is that the shell command executor that is used to start container-executor has threads reading from c-e's stdout and stderr. When the NM is restarted, these threads are killed. Then when the container-executor continues executing after the container exits with error, it tries to write to stderr (ERRORFILE) and gets a SIGPIPE. Since SIGPIPE is not handled, this crashes the container-executor before it can actually remove the docker container.

We ran into this in branch 2.8. The way docker containers are removed has been completely redesigned in trunk, so I don't think it will lead to this exact failure, but after an NM restart, potentially any write to stderr or stdout in the container-executor could cause it to crash.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

YARN-8515.001.patch
12/Jul/18 14:13
1 kB
Jim Brennan

Activity

People

Assignee:: Jim Brennan

Reporter:: Jim Brennan

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 10/Jul/18 21:51

Updated:: 13/Jul/18 15:21

Resolved:: 13/Jul/18 15:21