Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
None
-
None
-
None
Description
Scenario :
- set yarn.resourcemanager.am.max-attempts = 2
- start dshell application
yarn org.apache.hadoop.yarn.applications.distributedshell.Client -jar hadoop-yarn-applications-distributedshell-*.jar -attempt_failures_validity_interval 60000 -shell_command "sleep 150" -num_containers 16
- Kill AM pid
- Print container list for 2nd attempt
yarn container -list appattempt_1450825622869_0001_000002 INFO impl.TimelineClientImpl: Timeline service address: http://xxx:port/ws/v1/timeline/ INFO client.RMProxy: Connecting to ResourceManager at xxx/10.10.10.10:<port> Total number of containers :2 Container-Id Start Time Finish Time State Host Node Http Address LOG-URL container_e12_1450825622869_0001_02_000002 Tue Dec 22 23:07:35 +0000 2015 N/A RUNNING xxx:25454 http://xxx:8042 http://xxx:8042/node/containerlogs/container_e12_1450825622869_0001_02_000002/hrt_qa container_e12_1450825622869_0001_02_000001 Tue Dec 22 23:07:34 +0000 2015 N/A RUNNING xxx:25454 http://xxx:8042 http://xxx:8042/node/containerlogs/container_e12_1450825622869_0001_02_000001/hrt_qa
- look for new AM pid
Here, 2nd AM container was suppose to be started on container_e12_1450825622869_0001_02_000001. But AM was not launched on container_e12_1450825622869_0001_02_000001. It was in AQUIRED state.
On other hand, container_e12_1450825622869_0001_02_000002 got the AM running.
Expected behavior: RM should not start 2 containers for starting AM