Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-8508

On NodeManager container gets cleaned up before its pid file is created

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • None
    • 3.2.0, 3.1.1
    • None
    • None

    Description

      GPU failed to release even though the container using it is being killed

      2018-07-06 05:22:26,201 INFO  container.ContainerImpl (ContainerImpl.java:handle(2093)) - Container container_e20_1530854311763_0006_01_000001 transitioned from RUNNING to KILLING
      2018-07-06 05:22:26,250 INFO  container.ContainerImpl (ContainerImpl.java:handle(2093)) - Container container_e20_1530854311763_0006_01_000002 transitioned from RUNNING to KILLING
      2018-07-06 05:22:26,251 INFO  application.ApplicationImpl (ApplicationImpl.java:handle(632)) - Application application_1530854311763_0006 transitioned from RUNNING to FINISHING_CONTAINERS_WAIT
      2018-07-06 05:22:26,251 INFO  launcher.ContainerLaunch (ContainerLaunch.java:cleanupContainer(734)) - Cleaning up container container_e20_1530854311763_0006_01_000002
      2018-07-06 05:22:31,358 INFO  launcher.ContainerLaunch (ContainerLaunch.java:getContainerPid(1102)) - Could not get pid for container_e20_1530854311763_0006_01_000002. Waited for 5000 ms.
      2018-07-06 05:22:31,358 WARN  launcher.ContainerLaunch (ContainerLaunch.java:cleanupContainer(784)) - Container clean up before pid file created container_e20_1530854311763_0006_01_000002
      2018-07-06 05:22:31,359 INFO  launcher.ContainerLaunch (ContainerLaunch.java:reapDockerContainerNoPid(940)) - Unable to obtain pid, but docker container request detected. Attempting to reap container container_e20_1530854311763_0006_01_000002
      2018-07-06 05:22:31,494 INFO  nodemanager.LinuxContainerExecutor (LinuxContainerExecutor.java:deleteAsUser(828)) - Deleting absolute path : /grid/0/hadoop/yarn/local/usercache/hrt_qa/appcache/application_1530854311763_0006/container_e20_1530854311763_0006_01_000002/launch_container.sh
      2018-07-06 05:22:31,500 INFO  nodemanager.LinuxContainerExecutor (LinuxContainerExecutor.java:deleteAsUser(828)) - Deleting absolute path : /grid/0/hadoop/yarn/local/usercache/hrt_qa/appcache/application_1530854311763_0006/container_e20_1530854311763_0006_01_000002/container_tokens
      2018-07-06 05:22:31,510 INFO  container.ContainerImpl (ContainerImpl.java:handle(2093)) - Container container_e20_1530854311763_0006_01_000001 transitioned from KILLING to CONTAINER_CLEANEDUP_AFTER_KILL
      2018-07-06 05:22:31,510 INFO  container.ContainerImpl (ContainerImpl.java:handle(2093)) - Container container_e20_1530854311763_0006_01_000002 transitioned from KILLING to CONTAINER_CLEANEDUP_AFTER_KILL
      2018-07-06 05:22:31,512 INFO  container.ContainerImpl (ContainerImpl.java:handle(2093)) - Container container_e20_1530854311763_0006_01_000001 transitioned from CONTAINER_CLEANEDUP_AFTER_KILL to DONE
      2018-07-06 05:22:31,513 INFO  container.ContainerImpl (ContainerImpl.java:handle(2093)) - Container container_e20_1530854311763_0006_01_000002 transitioned from CONTAINER_CLEANEDUP_AFTER_KILL to DONE
      2018-07-06 05:22:38,955 INFO  container.ContainerImpl (ContainerImpl.java:handle(2093)) - Container container_e20_1530854311763_0007_01_000002 transitioned from NEW to SCHEDULED
      
      

      New container requesting for GPU fails to launch

      2018-07-06 05:22:39,048 ERROR nodemanager.LinuxContainerExecutor (LinuxContainerExecutor.java:handleLaunchForLaunchType(550)) - ResourceHandlerChain.preStart() failed!
      org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException: Failed to find enough GPUs, requestor=container_e20_1530854311763_0007_01_000002, #RequestedGPUs=2, #availableGpus=1
      	at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceAllocator.internalAssignGpus(GpuResourceAllocator.java:225)
      	at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceAllocator.assignGpus(GpuResourceAllocator.java:173)
      	at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceHandlerImpl.preStart(GpuResourceHandlerImpl.java:98)
      	at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerChain.preStart(ResourceHandlerChain.java:75)
      	at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleLaunchForLaunchType(LinuxContainerExecutor.java:509)
      	at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:479)
      	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.launchContainer(ContainerLaunch.java:494)
      	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:306)
      	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:103)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at java.lang.Thread.run(Thread.java:748)
      2018-07-06 05:22:39,049 WARN  launcher.ContainerLaunch (ContainerLaunch.java:call(331)) - Failed to launch container.
      java.io.IOException: ResourceHandlerChain.preStart() failed!
      	at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleLaunchForLaunchType(LinuxContainerExecutor.java:551)
      	at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:479)
      	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.launchContainer(ContainerLaunch.java:494)
      	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:306)
      	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:103)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at java.lang.Thread.run(Thread.java:748)
      Caused by: org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException: Failed to find enough GPUs, requestor=container_e20_1530854311763_0007_01_000002, #RequestedGPUs=2, #availableGpus=1
      	at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceAllocator.internalAssignGpus(GpuResourceAllocator.java:225)
      	at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceAllocator.assignGpus(GpuResourceAllocator.java:173)
      	at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceHandlerImpl.preStart(GpuResourceHandlerImpl.java:98)
      	at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerChain.preStart(ResourceHandlerChain.java:75)
      	at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleLaunchForLaunchType(LinuxContainerExecutor.java:509)
      	... 8 more
      

      Attachments

        1. YARN-8505.002.patch
          5 kB
          Chandni Singh
        2. YARN-8505.001.patch
          3 kB
          Chandni Singh

        Activity

          People

            csingh Chandni Singh
            ssathish@hortonworks.com Sumana Sathish
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: