Description
I encountered an interesting issue in one of our Yarn clusters (where the containers are stuck in localizing phase).
Our AM requests a container, and starts a process using the NMClient.
According to the NM the container is in LOCALIZING state:
1. 2017-01-09 22:06:18,362 [INFO] [AsyncDispatcher event handler] container.ContainerImpl.handle(ContainerImpl.java:1135) - Container container_e03_1481261762048_0541_02_000060 transitioned from NEW to LOCALIZING 2017-01-09 22:06:18,363 [INFO] [AsyncDispatcher event handler] localizer.ResourceLocalizationService$LocalizerTracker.handle(ResourceLocalizationService.java:711) - Created localizer for container_e03_1481261762048_0541_02_000060 2017-01-09 22:06:18,364 [INFO] [LocalizerRunner for container_e03_1481261762048_0541_02_000060] localizer.ResourceLocalizationService$LocalizerRunner.writeCredentials(ResourceLocalizationService.java:1191) - Writing credentials to the nmPrivate file /../..//.nmPrivate/container_e03_1481261762048_0541_02_000060.tokens. Credentials list:
According to the RM the container is in RUNNING state:
2017-01-09 22:06:17,110 [INFO] [IPC Server handler 19 on 8030] rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:410) - container_e03_1481261762048_0541_02_000060 Container Transitioned from ALLOCATED to ACQUIRED 2017-01-09 22:06:19,084 [INFO] [ResourceManager Event Processor] rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:410) - container_e03_1481261762048_0541_02_000060 Container Transitioned from ACQUIRED to RUNNING
When I click the Yarn RM UI to view the logs for the container, I get an error
that
No logs were found. state is LOCALIZING
The Node manager 's stack trace seems to indicate that the NM's LocalizerRunner is stuck waiting to read from the sub-process's outputstream.
"LocalizerRunner for container_e03_1481261762048_0541_02_000060" #27007081 prio=5 os_prio=0 tid=0x00007fa518849800 nid=0x15f7 runnable [0x00007fa5076c3000] java.lang.Thread.State: RUNNABLE at java.io.FileInputStream.readBytes(Native Method) at java.io.FileInputStream.read(FileInputStream.java:255) at java.io.BufferedInputStream.read1(BufferedInputStream.java:284) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) - locked <0x00000000c6dc9c50> (a java.lang.UNIXProcess$ProcessPipeInputStream) at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284) at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326) at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) - locked <0x00000000c6dc9c78> (a java.io.InputStreamReader) at java.io.InputStreamReader.read(InputStreamReader.java:184) at java.io.BufferedReader.fill(BufferedReader.java:161) at java.io.BufferedReader.read1(BufferedReader.java:212) at java.io.BufferedReader.read(BufferedReader.java:286) - locked <0x00000000c6dc9c78> (a java.io.InputStreamReader) at org.apache.hadoop.util.Shell$ShellCommandExecutor.parseExecResult(Shell.java:786) at org.apache.hadoop.util.Shell.runCommand(Shell.java:568) at org.apache.hadoop.util.Shell.run(Shell.java:479) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:237) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1113)
I did a
ps aux
and confirmed that there was no container-executor process running with INITIALIZE_CONTAINER that the localizer starts. It seems that the output stream pipe of the process is still not closed (even though the localizer process is no longer present).