When a container is killed by its AM we get a similar error message like this:
2019-06-30 12:09:04,412 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: Shell execution returned exit code: 143. Privileged Execution Operation Stderr: Stdout: main : command provided 1 main : run as user is systest main : requested yarn user is systest Getting exit code file... Creating script paths... Writing pid file... Writing to tmp file /yarn/nm/nmPrivate/application_1561921629886_0001/container_e84_1561921629886_0001_01_000019/container_e84_1561921629886_0001_01_000019.pid.tmp Writing to cgroup task files... Creating local dirs... Launching container... Getting exit code file... Creating script paths...
In container-executor.c the fork point is right after the "Creating script paths..." part, though in the Stdout log we can clearly see it has been written there twice. After consulting with pbacsko it seems like there's a missing flush in container-executor.c before the fork and that causes the duplication.
I suggest to add a flush there so that it won't be duplicated: it's a bit misleading that the child process writes out "Getting exit code file" and "Creating script paths" even though it is clearly not doing that.
A more appealing solution could be to revisit the fprintf-fflush pairs in the code and change them to a single call, so that the fflush calls would not be forgotten accidentally. (It can cause problems in every place where it's used).
Note: this issue probably affects every occasion of fork(), not just the one from launch_container_as_user in main.c.
- is related to
YARN-9717 Add more logging to container-executor about issues with directory creation or permissions