Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-6078

Containers stuck in Localizing state

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 3.0.0
    • None
    • None

    Description

      I encountered an interesting issue in one of our Yarn clusters (where the containers are stuck in localizing phase).

      Our AM requests a container, and starts a process using the NMClient.

      According to the NM the container is in LOCALIZING state:

      1. 2017-01-09 22:06:18,362 [INFO] [AsyncDispatcher event handler] container.ContainerImpl.handle(ContainerImpl.java:1135) - Container container_e03_1481261762048_0541_02_000060 transitioned from NEW to LOCALIZING
      2017-01-09 22:06:18,363 [INFO] [AsyncDispatcher event handler] localizer.ResourceLocalizationService$LocalizerTracker.handle(ResourceLocalizationService.java:711) - Created localizer for container_e03_1481261762048_0541_02_000060
      2017-01-09 22:06:18,364 [INFO] [LocalizerRunner for container_e03_1481261762048_0541_02_000060] localizer.ResourceLocalizationService$LocalizerRunner.writeCredentials(ResourceLocalizationService.java:1191) - Writing credentials to the nmPrivate file /../..//.nmPrivate/container_e03_1481261762048_0541_02_000060.tokens. Credentials list:
      

      According to the RM the container is in RUNNING state:

      2017-01-09 22:06:17,110 [INFO] [IPC Server handler 19 on 8030] rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:410) - container_e03_1481261762048_0541_02_000060 Container Transitioned from ALLOCATED to ACQUIRED
      2017-01-09 22:06:19,084 [INFO] [ResourceManager Event Processor] rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:410) - container_e03_1481261762048_0541_02_000060 Container Transitioned from ACQUIRED to RUNNING
      

      When I click the Yarn RM UI to view the logs for the container, I get an error
      that

      No logs were found. state is LOCALIZING
      

      The Node manager 's stack trace seems to indicate that the NM's LocalizerRunner is stuck waiting to read from the sub-process's outputstream.

      "LocalizerRunner for container_e03_1481261762048_0541_02_000060" #27007081 prio=5 os_prio=0 tid=0x00007fa518849800 nid=0x15f7 runnable [0x00007fa5076c3000]
         java.lang.Thread.State: RUNNABLE
      	at java.io.FileInputStream.readBytes(Native Method)
      	at java.io.FileInputStream.read(FileInputStream.java:255)
      	at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
      	at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
      	- locked <0x00000000c6dc9c50> (a java.lang.UNIXProcess$ProcessPipeInputStream)
      	at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
      	at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
      	at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
      	- locked <0x00000000c6dc9c78> (a java.io.InputStreamReader)
      	at java.io.InputStreamReader.read(InputStreamReader.java:184)
      	at java.io.BufferedReader.fill(BufferedReader.java:161)
      	at java.io.BufferedReader.read1(BufferedReader.java:212)
      	at java.io.BufferedReader.read(BufferedReader.java:286)
      	- locked <0x00000000c6dc9c78> (a java.io.InputStreamReader)
      	at org.apache.hadoop.util.Shell$ShellCommandExecutor.parseExecResult(Shell.java:786)
      	at org.apache.hadoop.util.Shell.runCommand(Shell.java:568)
      	at org.apache.hadoop.util.Shell.run(Shell.java:479)
      	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773)
      	at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:237)
      	at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1113)
      
      

      I did a

      ps aux

      and confirmed that there was no container-executor process running with INITIALIZE_CONTAINER that the localizer starts. It seems that the output stream pipe of the process is still not closed (even though the localizer process is no longer present).

      Attachments

        1. YARN-6078.001.patch
          2 kB
          Billie Rinaldi
        2. YARN-6078.002.patch
          9 kB
          Billie Rinaldi
        3. YARN-6078.003.patch
          10 kB
          Billie Rinaldi
        4. YARN-6078-branch-2.001.patch
          10 kB
          Billie Rinaldi

        Issue Links

          Activity

            People

              billie Billie Rinaldi
              jagadish1989@gmail.com Jagadish
              Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: