Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-9691

canceling upgrade does not work if upgrade failed container is existing

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Patch Available
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      if a container is failed to upgrade during yarn service upgrade, it will be released container and transition to FAILED_UPGRADE state.
      After then, I expected it is able to be back to the previous version using cancel-upgrade. but, It didn’t work.
      At that time, AM log is as follows

      # failed to upgrade container_e62_1563179597798_0006_01_000008
      
      2019-07-16 18:21:55,152 [IPC Server handler 0 on 39483] INFO  service.ClientAMService - Upgrade container container_e62_1563179597798_0006_01_000008
      2019-07-16 18:21:55,153 [Component  dispatcher] INFO  instance.ComponentInstance - [COMPINSTANCE sleep-0 : container_e62_1563179597798_0006_01_000008] spec state state changed from NEEDS_UPGRADE -> UPGRADING
      2019-07-16 18:21:55,154 [Component  dispatcher] INFO  instance.ComponentInstance - [COMPINSTANCE sleep-0 : container_e62_1563179597798_0006_01_000008] Transitioned from READY to UPGRADING on UPGRADE event
      2019-07-16 18:21:55,154 [pool-5-thread-4] INFO  registry.YarnRegistryViewForProviders - [COMPINSTANCE sleep-0 : container_e62_1563179597798_0006_01_000008]: Deleting registry path /users/test/services/yarn-service/sleeptest/components/ctr-e62-1563179597798-0006-01-000008
      2019-07-16 18:21:55,156 [pool-6-thread-6] INFO  provider.ProviderUtils - [COMPINSTANCE sleep-0 : container_e62_1563179597798_0006_01_000008] version 1.0.1 : Creating dir on hdfs: hdfs://test1.com:8020/user/test/.yarn/services/sleeptest/components/1.0.1/sleep/sleep-0
      2019-07-16 18:21:55,157 [pool-6-thread-6] INFO  containerlaunch.ContainerLaunchService - reInitializing container container_e62_1563179597798_0006_01_000008 with version 1.0.1
      2019-07-16 18:21:55,157 [pool-6-thread-6] INFO  containerlaunch.AbstractLauncher - yarn docker env var has been set {LANGUAGE=en_US.UTF-8, HADOOP_USER_NAME=test, YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_HOSTNAME=sleep-0.sleeptest.test.EXAMPLE.COM, WORK_DIR=$PWD, LC_ALL=en_US.UTF-8, YARN_CONTAINER_RUNTIME_TYPE=docker, YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=registry.test.com/test/sleep1:latest, LANG=en_US.UTF-8, YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=bridge, YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE=true, LOG_DIR=<LOG_DIR>}
      2019-07-16 18:21:55,158 [org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl #7] INFO  impl.NMClientAsyncImpl - Processing Event EventType: REINITIALIZE_CONTAINER for Container container_e62_1563179597798_0006_01_000008
      2019-07-16 18:21:55,167 [Component  dispatcher] INFO  instance.ComponentInstance - [COMPINSTANCE sleep-0 : container_e62_1563179597798_0006_01_000008] spec state state changed from UPGRADING -> RUNNING_BUT_UNREADY
      2019-07-16 18:21:55,167 [Component  dispatcher] INFO  instance.ComponentInstance - [COMPINSTANCE sleep-0 : container_e62_1563179597798_0006_01_000008] retrieve status after 30
      2019-07-16 18:21:55,167 [Component  dispatcher] INFO  instance.ComponentInstance - [COMPINSTANCE sleep-0 : container_e62_1563179597798_0006_01_000008] Transitioned from UPGRADING to REINITIALIZED on START event
      2019-07-16 18:22:07,797 [pool-7-thread-1] INFO  monitor.ServiceMonitor - Readiness check failed for sleep-0: Probe Status, time="Tue Jul 16 18:22:07 KST 2019", outcome="failure", message="Failure in Default probe: IP presence", exception="java.io.IOException: sleep-0: IP is not available yet"
      2019-07-16 18:22:37,797 [pool-7-thread-1] INFO  monitor.ServiceMonitor - Readiness check failed for sleep-0: Probe Status, time="Tue Jul 16 18:22:37 KST 2019", outcome="failure", message="Failure in Default probe: IP presence", exception="java.io.IOException: sleep-0: IP is not available yet"
      2019-07-16 18:23:07,797 [pool-7-thread-1] INFO  monitor.ServiceMonitor - Readiness check failed for sleep-0: Probe Status, time="Tue Jul 16 18:23:07 KST 2019", outcome="failure", message="Failure in Default probe: IP presence", exception="java.io.IOException: sleep-0: IP is not available yet"
      2019-07-16 18:23:08,225 [Component  dispatcher] INFO  instance.ComponentInstance - [COMPINSTANCE sleep-0 : container_e62_1563179597798_0006_01_000008] spec state state changed from RUNNING_BUT_UNREADY -> FAILED_UPGRADE
      
      # request canceling upgrade 
      
      2019-07-16 18:28:22,713 [Component  dispatcher] INFO  service.ServiceManager - Upgrade container container_e62_1563179597798_0006_01_000004 true
      2019-07-16 18:28:22,713 [Component  dispatcher] INFO  service.ServiceManager - Upgrade container container_e62_1563179597798_0006_01_000003 true
      2019-07-16 18:28:22,713 [Component  dispatcher] INFO  service.ServiceManager - Upgrade container container_e62_1563179597798_0006_01_000008 true
      2019-07-16 18:28:22,713 [Component  dispatcher] INFO  service.ServiceManager - [SERVICE] spec state changed from UPGRADING -> CANCEL_UPGRADING
      2019-07-16 18:28:22,713 [Component  dispatcher] INFO  component.Component - [COMPONENT sleep]: need upgrade to 1.0.0
      2019-07-16 18:28:22,713 [Component  dispatcher] INFO  instance.ComponentInstance - [COMPINSTANCE sleep-0 : container_e62_1563179597798_0006_01_000008] spec state state changed from FAILED_UPGRADE -> NEEDS_UPGRADE
      2019-07-16 18:28:22,713 [Component  dispatcher] INFO  component.Component - [COMPONENT sleep] Transitioned from UPGRADING to CANCEL_UPGRADING on CANCEL_UPGRADE event.
      2019-07-16 18:28:22,713 [Component  dispatcher] INFO  component.Component - [COMPONENT sleep1]: need upgrade to 1.0.0
      2019-07-16 18:28:22,714 [Component  dispatcher] INFO  component.Component - [COMPONENT sleep1] Transitioned from UPGRADING to CANCEL_UPGRADING on CANCEL_UPGRADE event.
      2019-07-16 18:28:22,714 [Component  dispatcher] INFO  instance.ComponentInstance - container_e62_1563179597798_0006_01_000004 nothing to cancel
      2019-07-16 18:28:22,714 [Component  dispatcher] INFO  instance.ComponentInstance - [COMPINSTANCE sleep-2 : container_e62_1563179597798_0006_01_000004] spec state state changed from NEEDS_UPGRADE -> READY
      2019-07-16 18:28:22,714 [Component  dispatcher] INFO  instance.ComponentInstance - container_e62_1563179597798_0006_01_000003 nothing to cancel
      2019-07-16 18:28:22,714 [Component  dispatcher] INFO  instance.ComponentInstance - [COMPINSTANCE sleep-1 : container_e62_1563179597798_0006_01_000003] spec state state changed from NEEDS_UPGRADE -> READY
      2019-07-16 18:28:22,714 [Component  dispatcher] ERROR service.ServiceScheduler - No component instance exists for container_e62_1563179597798_0006_01_000008
      
      

      Attachments

        1. YARN-9691.001.patch
          5 kB
          kyungwan nam
        2. YARN-9691.002.patch
          6 kB
          kyungwan nam

        Activity

          People

            kyungwan nam kyungwan nam
            kyungwan nam kyungwan nam
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: