Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-3611 Support Docker Containers In LinuxContainerExecutor
  3. YARN-8465

Dshell docker container gets marked as lost after NM restart

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersConvert to IssueMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      scenario:
      1) launch dshell application

      yarn  jar /usr/hdp/current/hadoop-yarn-client/hadoop-yarn-applications-distributedshell.jar  -shell_command "sleep 500" -num_containers 2 -shell_env YARN_CONTAINER_RUNTIME_TYPE=docker -shell_env YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=xx/httpd:0.1 -jar /usr/hdp/current/hadoop-yarn-client/hadoop-yarn-applications-distributedshell.jar

      2) wait for app to be in stable state ( container_e01_1529968198450_0001_01_000002 is running on host7 and container_e01_1529968198450_0001_01_000003 is running on host5)
      3) restart NM (host7)

      Here, dshell application fails with below error

      18/06/25 23:35:30 INFO distributedshell.Client: Got application report from ASM for, appId=1, clientToAMToken=Token { kind: YARN_CLIENT_TOKEN, service:  }, appDiagnostics=, appMasterHost=host9/xxx, appQueue=default, appMasterRpcPort=-1, appStartTime=1529969211776, yarnAppState=RUNNING, distributedFinalState=UNDEFINED, appTrackingUrl=https://host4:8090/proxy/application_1529968198450_0001/, appUser=hbase
      18/06/25 23:35:31 INFO distributedshell.Client: Got application report from ASM for, appId=1, clientToAMToken=null, appDiagnostics=Application Failure: desired = 2, completed = 2, allocated = 2, failed = 1, diagnostics = [2018-06-25 23:35:28.000]Container exited with a non-zero exit code 154
      [2018-06-25 23:35:28.001]Container exited with a non-zero exit code 154
      , appMasterHost=host9/xxx, appQueue=default, appMasterRpcPort=-1, appStartTime=1529969211776, yarnAppState=FINISHED, distributedFinalState=FAILED, appTrackingUrl=https://host4:8090/proxy/application_1529968198450_0001/, appUser=hbase
      18/06/25 23:35:31 INFO distributedshell.Client: Application did finished unsuccessfully. YarnState=FINISHED, DSFinalStatus=FAILED. Breaking monitoring loop
      18/06/25 23:35:31 ERROR distributedshell.Client: Application failed to complete successfully

      Here, the docker container marked as LOST after completion

      2018-06-25 23:35:27,970 WARN  runtime.DockerLinuxContainerRuntime (DockerLinuxContainerRuntime.java:signalContainer(1034)) - Signal docker container failed. Exception:
      org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException: Liveliness check failed for PID: 423695. Container may have already completed.
              at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime.executeLivelinessCheck(DockerLinuxContainerRuntime.java:1208)
              at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime.signalContainer(DockerLinuxContainerRuntime.java:1026)
              at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:159)
              at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:755)
              at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.isContainerAlive(LinuxContainerExecutor.java:905)
              at org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:284)
              at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.reacquireContainer(LinuxContainerExecutor.java:721)
              at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:84)
              at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:47)
              at java.util.concurrent.FutureTask.run(FutureTask.java:266)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
              at java.lang.Thread.run(Thread.java:748)
      2018-06-25 23:35:27,975 WARN  nodemanager.LinuxContainerExecutor (LinuxContainerExecutor.java:signalContainer(762)) - Error in signalling container 423695 with NULL; exit = -1
      org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException: Signal docker container failed
              at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime.signalContainer(DockerLinuxContainerRuntime.java:1036)
              at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:159)
              at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:755)
              at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.isContainerAlive(LinuxContainerExecutor.java:905)
              at org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:284)
              at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.reacquireContainer(LinuxContainerExecutor.java:721)
              at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:84)
              at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:47)
              at java.util.concurrent.FutureTask.run(FutureTask.java:266)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
              at java.lang.Thread.run(Thread.java:748)

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            shanekumpf@gmail.com Shane Kumpf Assign to me
            yeshavora Yesha Vora
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment