Details
-
Sub-task
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
3.1.0
-
None
Description
This is easy to test on a service with anti-affinity component, to simulate pending container requests. It can be simulated by other means also (no resource left in cluster, etc.).
Service yarnfile used to test this -
{ "name": "sleeper-service", "version": "1", "components" : [ { "name": "ping", "number_of_containers": 2, "resource": { "cpus": 1, "memory": "256" }, "launch_command": "sleep 9000", "placement_policy": { "constraints": [ { "type": "ANTI_AFFINITY", "scope": "NODE", "target_tags": [ "ping" ] } ] } } ] }
Launch a service with the above yarnfile as below -
yarn app -launch simple-aa-1 simple_AA.json
Let's assume there are only 5 nodes in this cluster. Now, flex the above service to 1 extra container than the number of nodes (6 in my case).
yarn app -flex simple-aa-1 -component ping 6
Only 5 containers will be allocated and running for simple-aa-1. At this point, flex it down to 5 containers -
yarn app -flex simple-aa-1 -component ping 5
This is what is seen in the serviceam log at this point -
2018-05-03 20:17:38,469 [IPC Server handler 0 on 38124] INFO service.ClientAMService - Flexing component ping to 5 2018-05-03 20:17:38,469 [Component dispatcher] INFO component.Component - [FLEX DOWN COMPONENT ping]: scaling down from 6 to 5 2018-05-03 20:17:38,470 [Component dispatcher] INFO instance.ComponentInstance - [COMPINSTANCE ping-4 : container_1525297086734_0013_01_000006]: Flexed down by user, destroying. 2018-05-03 20:17:38,473 [Component dispatcher] INFO component.Component - [COMPONENT ping] Transitioned from FLEXING to STABLE on FLEX event. 2018-05-03 20:17:38,474 [pool-5-thread-8] INFO registry.YarnRegistryViewForProviders - [COMPINSTANCE ping-4 : container_1525297086734_0013_01_000006]: Deleting registry path /users/root/services/yarn-service/simple-aa-1/components/ctr-1525297086734-0013-01-000006 2018-05-03 20:17:38,476 [Component dispatcher] ERROR component.Component - [COMPONENT ping]: Invalid event CHECK_STABLE at STABLE org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: CHECK_STABLE at STABLE at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:388) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) at org.apache.hadoop.yarn.service.component.Component.handle(Component.java:913) at org.apache.hadoop.yarn.service.ServiceScheduler$ComponentEventHandler.handle(ServiceScheduler.java:574) at org.apache.hadoop.yarn.service.ServiceScheduler$ComponentEventHandler.handle(ServiceScheduler.java:563) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) at java.lang.Thread.run(Thread.java:745) 2018-05-03 20:17:38,480 [Component dispatcher] ERROR component.Component - [COMPONENT ping]: Invalid event CHECK_STABLE at STABLE org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: CHECK_STABLE at STABLE at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:388) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) at org.apache.hadoop.yarn.service.component.Component.handle(Component.java:913) at org.apache.hadoop.yarn.service.ServiceScheduler$ComponentEventHandler.handle(ServiceScheduler.java:574) at org.apache.hadoop.yarn.service.ServiceScheduler$ComponentEventHandler.handle(ServiceScheduler.java:563) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) at java.lang.Thread.run(Thread.java:745) 2018-05-03 20:17:38,578 [pool-5-thread-8] INFO instance.ComponentInstance - [COMPINSTANCE ping-4 : container_1525297086734_0013_01_000006]: Deleted component instance dir: hdfs://ctr-e138-1518143905142-280820-01-000003.example.site:8020/user/root/.yarn/services/simple-aa-1/components/1/ping/ping-4 2018-05-03 20:17:39,268 [AMRM Callback Handler Thread] WARN service.ServiceScheduler - Container container_1525297086734_0013_01_000006 Completed. No component instance exists. exitStatus=-100. diagnostics=Container released by application 2018-05-03 20:17:40,273 [AMRM Callback Handler Thread] INFO service.ServiceScheduler - 1 containers allocated. 2018-05-03 20:17:40,273 [AMRM Callback Handler Thread] INFO service.ServiceScheduler - [COMPONENT ping]: remove 0 outstanding container requests for allocateId 0 2018-05-03 20:17:40,274 [Component dispatcher] INFO component.Component - [COMPONENT ping]: container_1525297086734_0013_01_000007 allocated, num pending component instances reduced to 0 2018-05-03 20:17:40,274 [Component dispatcher] INFO component.Component - [COMPONENT ping]: Assigned container_1525297086734_0013_01_000007 to component instance ping-5 and launch on host ctr-e138-1518143905142-280820-01-000008.example.site:25454 2018-05-03 20:17:40,277 [pool-6-thread-6] INFO provider.ProviderUtils - [COMPINSTANCE ping-5 : container_1525297086734_0013_01_000007]: Creating dir on hdfs: hdfs://ctr-e138-1518143905142-280820-01-000003.example.site:8020/user/root/.yarn/services/simple-aa-1/components/1/ping/ping-5 2018-05-03 20:17:40,316 [pool-6-thread-6] INFO containerlaunch.ContainerLaunchService - launching container container_1525297086734_0013_01_000007 2018-05-03 20:17:40,318 [org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl #5] INFO impl.NMClientAsyncImpl - Processing Event EventType: START_CONTAINER for Container container_1525297086734_0013_01_000007 2018-05-03 20:17:40,338 [Component dispatcher] ERROR component.Component - [COMPONENT ping]: Invalid event CONTAINER_STARTED at STABLE org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: CONTAINER_STARTED at STABLE at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) at org.apache.hadoop.yarn.service.component.Component.handle(Component.java:913) at org.apache.hadoop.yarn.service.ServiceScheduler$ComponentEventHandler.handle(ServiceScheduler.java:574) at org.apache.hadoop.yarn.service.ServiceScheduler$ComponentEventHandler.handle(ServiceScheduler.java:563) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) at java.lang.Thread.run(Thread.java:745)
Status response shows that only 4 containers are running and the service is not in STABLE state -
yarn app -status simple-aa-1
output -
{ "components": [ { "configuration": { "env": {}, "files": [], "properties": {} }, "containers": [ { "bare_host": "ctr-e138-1518143905142-280820-01-000007.example.site", "component_instance_name": "ping-1", "hostname": "ctr-e138-1518143905142-280820-01-000007.example.site", "id": "container_1525297086734_0013_01_000003", "ip": "x.x.x.x", "launch_time": 1525378141535, "state": "READY" }, { "bare_host": "ctr-e138-1518143905142-280820-01-000006.example.site", "component_instance_name": "ping-0", "hostname": "ctr-e138-1518143905142-280820-01-000006.example.site", "id": "container_1525297086734_0013_01_000002", "ip": "x.x.x.x", "launch_time": 1525378141513, "state": "READY" }, { "bare_host": "ctr-e138-1518143905142-280820-01-000005.example.site", "component_instance_name": "ping-3", "hostname": "ctr-e138-1518143905142-280820-01-000005.example.site", "id": "container_1525297086734_0013_01_000005", "ip": "x.x.x.x", "launch_time": 1525378303429, "state": "READY" }, { "bare_host": "ctr-e138-1518143905142-280820-01-000004.example.site", "component_instance_name": "ping-2", "hostname": "ctr-e138-1518143905142-280820-01-000004.example.site", "id": "container_1525297086734_0013_01_000004", "ip": "x.x.x.x", "launch_time": 1525378303425, "state": "READY" } ], "dependencies": [], "launch_command": "sleep 9000", "name": "ping", "number_of_containers": 5, "placement_policy": { "constraints": [ { "node_attributes": {}, "node_partitions": [], "scope": "NODE", "target_tags": [ "ping" ], "type": "ANTI_AFFINITY" } ] }, "quicklinks": [], "resource": { "additional": {}, "cpus": 1, "memory": "256" }, "run_privileged_container": false, "state": "FLEXING" } ], "configuration": { "env": {}, "files": [], "properties": {} }, "id": "application_1525297086734_0013", "kerberos_principal": {}, "lifetime": -1, "name": "simple-aa-1", "quicklinks": {}, "state": "STARTED", "version": "1" }