Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-7054 Yarn Service Phase 2
  3. YARN-8243

Flex down should remove instance with largest component instance ID first

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 3.1.0
    • 3.2.0, 3.1.1
    • yarn-native-services
    • None

    Description

      This is easy to test on a service with anti-affinity component, to simulate pending container requests. It can be simulated by other means also (no resource left in cluster, etc.).

      Service yarnfile used to test this -

      {
        "name": "sleeper-service",
        "version": "1",
        "components" :
        [
          {
            "name": "ping",
            "number_of_containers": 2,
            "resource": {
              "cpus": 1,
              "memory": "256"
            },
            "launch_command": "sleep 9000",
            "placement_policy": {
              "constraints": [
                {
                  "type": "ANTI_AFFINITY",
                  "scope": "NODE",
                  "target_tags": [
                    "ping"
                  ]
                }
              ]
            }
          }
        ]
      }
      

      Launch a service with the above yarnfile as below -

      yarn app -launch simple-aa-1 simple_AA.json
      

      Let's assume there are only 5 nodes in this cluster. Now, flex the above service to 1 extra container than the number of nodes (6 in my case).

      yarn app -flex simple-aa-1 -component ping 6
      

      Only 5 containers will be allocated and running for simple-aa-1. At this point, flex it down to 5 containers -

      yarn app -flex simple-aa-1 -component ping 5
      

      This is what is seen in the serviceam log at this point -

      2018-05-03 20:17:38,469 [IPC Server handler 0 on 38124] INFO  service.ClientAMService - Flexing component ping to 5
      2018-05-03 20:17:38,469 [Component  dispatcher] INFO  component.Component - [FLEX DOWN COMPONENT ping]: scaling down from 6 to 5
      2018-05-03 20:17:38,470 [Component  dispatcher] INFO  instance.ComponentInstance - [COMPINSTANCE ping-4 : container_1525297086734_0013_01_000006]: Flexed down by user, destroying.
      2018-05-03 20:17:38,473 [Component  dispatcher] INFO  component.Component - [COMPONENT ping] Transitioned from FLEXING to STABLE on FLEX event.
      2018-05-03 20:17:38,474 [pool-5-thread-8] INFO  registry.YarnRegistryViewForProviders - [COMPINSTANCE ping-4 : container_1525297086734_0013_01_000006]: Deleting registry path /users/root/services/yarn-service/simple-aa-1/components/ctr-1525297086734-0013-01-000006
      2018-05-03 20:17:38,476 [Component  dispatcher] ERROR component.Component - [COMPONENT ping]: Invalid event CHECK_STABLE at STABLE
      org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: CHECK_STABLE at STABLE
      	at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:388)
      	at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
      	at org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
      	at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
      	at org.apache.hadoop.yarn.service.component.Component.handle(Component.java:913)
      	at org.apache.hadoop.yarn.service.ServiceScheduler$ComponentEventHandler.handle(ServiceScheduler.java:574)
      	at org.apache.hadoop.yarn.service.ServiceScheduler$ComponentEventHandler.handle(ServiceScheduler.java:563)
      	at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
      	at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
      	at java.lang.Thread.run(Thread.java:745)
      2018-05-03 20:17:38,480 [Component  dispatcher] ERROR component.Component - [COMPONENT ping]: Invalid event CHECK_STABLE at STABLE
      org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: CHECK_STABLE at STABLE
      	at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:388)
      	at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
      	at org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
      	at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
      	at org.apache.hadoop.yarn.service.component.Component.handle(Component.java:913)
      	at org.apache.hadoop.yarn.service.ServiceScheduler$ComponentEventHandler.handle(ServiceScheduler.java:574)
      	at org.apache.hadoop.yarn.service.ServiceScheduler$ComponentEventHandler.handle(ServiceScheduler.java:563)
      	at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
      	at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
      	at java.lang.Thread.run(Thread.java:745)
      2018-05-03 20:17:38,578 [pool-5-thread-8] INFO  instance.ComponentInstance - [COMPINSTANCE ping-4 : container_1525297086734_0013_01_000006]: Deleted component instance dir: hdfs://ctr-e138-1518143905142-280820-01-000003.example.site:8020/user/root/.yarn/services/simple-aa-1/components/1/ping/ping-4
      2018-05-03 20:17:39,268 [AMRM Callback Handler Thread] WARN  service.ServiceScheduler - Container container_1525297086734_0013_01_000006 Completed. No component instance exists. exitStatus=-100. diagnostics=Container released by application 
      2018-05-03 20:17:40,273 [AMRM Callback Handler Thread] INFO  service.ServiceScheduler - 1 containers allocated. 
      2018-05-03 20:17:40,273 [AMRM Callback Handler Thread] INFO  service.ServiceScheduler - [COMPONENT ping]: remove 0 outstanding container requests for allocateId 0
      2018-05-03 20:17:40,274 [Component  dispatcher] INFO  component.Component - [COMPONENT ping]: container_1525297086734_0013_01_000007 allocated, num pending component instances reduced to 0
      2018-05-03 20:17:40,274 [Component  dispatcher] INFO  component.Component - [COMPONENT ping]: Assigned container_1525297086734_0013_01_000007 to component instance ping-5 and launch on host ctr-e138-1518143905142-280820-01-000008.example.site:25454 
      2018-05-03 20:17:40,277 [pool-6-thread-6] INFO  provider.ProviderUtils - [COMPINSTANCE ping-5 : container_1525297086734_0013_01_000007]: Creating dir on hdfs: hdfs://ctr-e138-1518143905142-280820-01-000003.example.site:8020/user/root/.yarn/services/simple-aa-1/components/1/ping/ping-5
      2018-05-03 20:17:40,316 [pool-6-thread-6] INFO  containerlaunch.ContainerLaunchService - launching container container_1525297086734_0013_01_000007
      2018-05-03 20:17:40,318 [org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl #5] INFO  impl.NMClientAsyncImpl - Processing Event EventType: START_CONTAINER for Container container_1525297086734_0013_01_000007
      2018-05-03 20:17:40,338 [Component  dispatcher] ERROR component.Component - [COMPONENT ping]: Invalid event CONTAINER_STARTED at STABLE
      org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: CONTAINER_STARTED at STABLE
      	at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
      	at org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
      	at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
      	at org.apache.hadoop.yarn.service.component.Component.handle(Component.java:913)
      	at org.apache.hadoop.yarn.service.ServiceScheduler$ComponentEventHandler.handle(ServiceScheduler.java:574)
      	at org.apache.hadoop.yarn.service.ServiceScheduler$ComponentEventHandler.handle(ServiceScheduler.java:563)
      	at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
      	at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
      	at java.lang.Thread.run(Thread.java:745)
      

      Status response shows that only 4 containers are running and the service is not in STABLE state -

      yarn app -status simple-aa-1
      

      output -

      {
          "components": [
              {
                  "configuration": {
                      "env": {},
                      "files": [],
                      "properties": {}
                  },
                  "containers": [
                      {
                          "bare_host": "ctr-e138-1518143905142-280820-01-000007.example.site",
                          "component_instance_name": "ping-1",
                          "hostname": "ctr-e138-1518143905142-280820-01-000007.example.site",
                          "id": "container_1525297086734_0013_01_000003",
                          "ip": "x.x.x.x",
                          "launch_time": 1525378141535,
                          "state": "READY"
                      },
                      {
                          "bare_host": "ctr-e138-1518143905142-280820-01-000006.example.site",
                          "component_instance_name": "ping-0",
                          "hostname": "ctr-e138-1518143905142-280820-01-000006.example.site",
                          "id": "container_1525297086734_0013_01_000002",
                          "ip": "x.x.x.x",
                          "launch_time": 1525378141513,
                          "state": "READY"
                      },
                      {
                          "bare_host": "ctr-e138-1518143905142-280820-01-000005.example.site",
                          "component_instance_name": "ping-3",
                          "hostname": "ctr-e138-1518143905142-280820-01-000005.example.site",
                          "id": "container_1525297086734_0013_01_000005",
                          "ip": "x.x.x.x",
                          "launch_time": 1525378303429,
                          "state": "READY"
                      },
                      {
                          "bare_host": "ctr-e138-1518143905142-280820-01-000004.example.site",
                          "component_instance_name": "ping-2",
                          "hostname": "ctr-e138-1518143905142-280820-01-000004.example.site",
                          "id": "container_1525297086734_0013_01_000004",
                          "ip": "x.x.x.x",
                          "launch_time": 1525378303425,
                          "state": "READY"
                      }
                  ],
                  "dependencies": [],
                  "launch_command": "sleep 9000",
                  "name": "ping",
                  "number_of_containers": 5,
                  "placement_policy": {
                      "constraints": [
                          {
                              "node_attributes": {},
                              "node_partitions": [],
                              "scope": "NODE",
                              "target_tags": [
                                  "ping"
                              ],
                              "type": "ANTI_AFFINITY"
                          }
                      ]
                  },
                  "quicklinks": [],
                  "resource": {
                      "additional": {},
                      "cpus": 1,
                      "memory": "256"
                  },
                  "run_privileged_container": false,
                  "state": "FLEXING"
              }
          ],
          "configuration": {
              "env": {},
              "files": [],
              "properties": {}
          },
          "id": "application_1525297086734_0013",
          "kerberos_principal": {},
          "lifetime": -1,
          "name": "simple-aa-1",
          "quicklinks": {},
          "state": "STARTED",
          "version": "1"
      }
      

      Attachments

        1. YARN-8243.01.patch
          14 kB
          Gour Saha
        2. YARN-8243.02.patch
          12 kB
          Gour Saha

        Activity

          People

            gsaha Gour Saha
            gsaha Gour Saha
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: