Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-4549

Containers stuck in KILLING state

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Invalid
    • 2.7.1
    • None
    • None
    • None

    Description

      We are running samza 0.8 on YARN 2.7.1 with LinuxContainerExecutor as the container-executor with cgroups configuration. Also we have NM recovery enabled.

      We observe a lot of containers that get stuck in the KIILLING state after the NM tries to kill them. The container remains running indefinitely, this causes some duplication as new containers are brought up to replace them. Looking through the logs NM can't seem to get the container PID.

      16/01/05 05:16:44 INFO containermanager.ContainerManagerImpl: Stopping container with container Id: container_1448454866800_0023_01_000005
      16/01/05 05:16:44 INFO nodemanager.NMAuditLogger: USER=ec2-user IP=10.51.111.243        OPERATION=Stop Container Request        TARGET=ContainerManageImpl      RESULT=SUCCESS  APPID=application_1448454866800_0023    CONTAINERID=container_1448454866800_0023_01_000005
      16/01/05 05:16:44 INFO container.ContainerImpl: Container container_1448454866800_0023_01_000005 transitioned from RUNNING to KILLING
      16/01/05 05:16:44 INFO launcher.ContainerLaunch: Cleaning up container container_1448454866800_0023_01_000005
      16/01/05 05:16:47 INFO launcher.ContainerLaunch: Could not get pid for container_1448454866800_0023_01_000005. Waited for 2000 ms.
      

      The PID files for containers in the KILLING state are missing, and a few other container that have been in the RUNNING state for a few weeks are also missing them. We waren't able to consistently replicate this and hoping that someone has come across this before.

      Attachments

        Activity

          People

            Unassigned Unassigned
            danil Danil Serdyuchenko
            Votes:
            1 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: