Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-6078

Containers stuck in Localizing state

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.0.0
    • Component/s: None
    • Labels:
      None

      Description

      I encountered an interesting issue in one of our Yarn clusters (where the containers are stuck in localizing phase).

      Our AM requests a container, and starts a process using the NMClient.

      According to the NM the container is in LOCALIZING state:

      1. 2017-01-09 22:06:18,362 [INFO] [AsyncDispatcher event handler] container.ContainerImpl.handle(ContainerImpl.java:1135) - Container container_e03_1481261762048_0541_02_000060 transitioned from NEW to LOCALIZING
      2017-01-09 22:06:18,363 [INFO] [AsyncDispatcher event handler] localizer.ResourceLocalizationService$LocalizerTracker.handle(ResourceLocalizationService.java:711) - Created localizer for container_e03_1481261762048_0541_02_000060
      2017-01-09 22:06:18,364 [INFO] [LocalizerRunner for container_e03_1481261762048_0541_02_000060] localizer.ResourceLocalizationService$LocalizerRunner.writeCredentials(ResourceLocalizationService.java:1191) - Writing credentials to the nmPrivate file /../..//.nmPrivate/container_e03_1481261762048_0541_02_000060.tokens. Credentials list:
      

      According to the RM the container is in RUNNING state:

      2017-01-09 22:06:17,110 [INFO] [IPC Server handler 19 on 8030] rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:410) - container_e03_1481261762048_0541_02_000060 Container Transitioned from ALLOCATED to ACQUIRED
      2017-01-09 22:06:19,084 [INFO] [ResourceManager Event Processor] rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:410) - container_e03_1481261762048_0541_02_000060 Container Transitioned from ACQUIRED to RUNNING
      

      When I click the Yarn RM UI to view the logs for the container, I get an error
      that

      No logs were found. state is LOCALIZING
      

      The Node manager 's stack trace seems to indicate that the NM's LocalizerRunner is stuck waiting to read from the sub-process's outputstream.

      "LocalizerRunner for container_e03_1481261762048_0541_02_000060" #27007081 prio=5 os_prio=0 tid=0x00007fa518849800 nid=0x15f7 runnable [0x00007fa5076c3000]
         java.lang.Thread.State: RUNNABLE
      	at java.io.FileInputStream.readBytes(Native Method)
      	at java.io.FileInputStream.read(FileInputStream.java:255)
      	at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
      	at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
      	- locked <0x00000000c6dc9c50> (a java.lang.UNIXProcess$ProcessPipeInputStream)
      	at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
      	at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
      	at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
      	- locked <0x00000000c6dc9c78> (a java.io.InputStreamReader)
      	at java.io.InputStreamReader.read(InputStreamReader.java:184)
      	at java.io.BufferedReader.fill(BufferedReader.java:161)
      	at java.io.BufferedReader.read1(BufferedReader.java:212)
      	at java.io.BufferedReader.read(BufferedReader.java:286)
      	- locked <0x00000000c6dc9c78> (a java.io.InputStreamReader)
      	at org.apache.hadoop.util.Shell$ShellCommandExecutor.parseExecResult(Shell.java:786)
      	at org.apache.hadoop.util.Shell.runCommand(Shell.java:568)
      	at org.apache.hadoop.util.Shell.run(Shell.java:479)
      	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773)
      	at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:237)
      	at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1113)
      
      

      I did a

      ps aux

      and confirmed that there was no container-executor process running with INITIALIZE_CONTAINER that the localizer starts. It seems that the output stream pipe of the process is still not closed (even though the localizer process is no longer present).

        Attachments

        1. YARN-6078-branch-2.001.patch
          10 kB
          Billie Rinaldi
        2. YARN-6078.003.patch
          10 kB
          Billie Rinaldi
        3. YARN-6078.002.patch
          9 kB
          Billie Rinaldi
        4. YARN-6078.001.patch
          2 kB
          Billie Rinaldi

          Issue Links

            Activity

              People

              • Assignee:
                billie.rinaldi Billie Rinaldi
                Reporter:
                jagadish1989@gmail.com Jagadish
              • Votes:
                0 Vote for this issue
                Watchers:
                16 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: