Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-7278

LinuxContainer in docker mode will be failed when nodemanager restart, because timeout for docker is too slow.

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: 2.8.0
    • Fix Version/s: None
    • Component/s: nodemanager
    • Labels:
    • Environment:

      CentOS

    • Target Version/s:

      Description

      In our cluster, nodemanagere recovery is turn on, and we use LinuxConainer with docker mode.
      Container may be failed when nodemanager restart, exception is below:

      [2017-09-29T15:47:14.433+08:00] [INFO] containermanager.monitor.ContainersMonitorImpl.run(ContainersMonitorImpl.java 472) [Container Monitor] : Memory usage of ProcessTree 120523 for container-id container_1506600355508_0023_01_000004: -1B of 10 GB physical memory used; -1B of 31 GB virtual memory used
      [2017-09-29T15:47:15.219+08:00] [ERROR] containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java 93) [ContainersLauncher #1] : Unable to recover container container_1506600355508_0023_01_000004
      java.io.IOException: Timeout while waiting for exit code from container_1506600355508_0023_01_000004
      [2017-09-29T15:47:15.220+08:00] [INFO] containermanager.container.ContainerImpl.handle(ContainerImpl.java 1142) [AsyncDispatcher event handler] : Container container_1506600355508_0023_01_000004 transitioned from RUNNING to EXITED_WITH_FAILURE
      [2017-09-29T15:47:15.221+08:00] [INFO] containermanager.launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java 440) [AsyncDispatcher event handler] : Cleaning up container container_1506600355508_0023_01_000004
      

      I guess the proccess is done, but 2 seconde later( the variable is msecLeft), the *.pid.exitcode wasn't created. Then I changed variable to 20000ms, The container is succeed when nodemanger is restart.
      So I think it is too short for docker container to complete the work.

      In docker mode of LinuxContainer, nm monitor the real task which is launched by "docker run" command. Then "docker wait" command will wait for exitcode, then "docker rm" will delete the docker container. Lastly, container-executor will write the exit code. So if some docker command is slow enough, nm wouldn't monitor the container. In fact, docker rm is always slow.

      I think the exit code of docker rm dosen't matter with the real task, so I think we could move the operation of write "*.pid.exitcode" before the command of docker rm. Or monitor the docker wait proccess, but not the real task.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                zhengchenyu zhengchenyu
              • Votes:
                1 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - 1m
                  1m
                  Remaining:
                  Remaining Estimate - 1m
                  1m
                  Logged:
                  Time Spent - Not Specified
                  Not Specified