Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
None
-
None
-
None
-
None
Description
Scenario:
1) Run Spark Long Running application
2) Do RM and NN failover randomly
3) Validate App state in Yarn
The Spark applications are finished. Yarn-cli returns correct status of yarn application.
[hrt_qa@xxx hadoopqe]$ yarn application -status application_1503203977699_0014 17/08/21 16:56:10 INFO client.AHSProxy: Connecting to Application History server at host1 xxx.xx.xx.x:10200 17/08/21 16:56:10 INFO client.RequestHedgingRMFailoverProxyProvider: Looking for the active RM in [rm1, rm2]... 17/08/21 16:56:10 INFO client.RequestHedgingRMFailoverProxyProvider: Found active RM [rm1] Application Report : Application-Id : application_1503203977699_0014 Application-Name : org.apache.spark.sql.execution.datasources.hbase.examples.LRJobForDataSources Application-Type : SPARK User : hrt_qa Queue : default Application Priority : null Start-Time : 1503215983532 Finish-Time : 1503250203806 Progress : 0% State : FAILED Final-State : FAILED Tracking-URL : https://host1:8090/cluster/app/application_1503203977699_0014 RPC Port : -1 AM Host : N/A Aggregate Resource Allocation : 174722793 MB-seconds, 170603 vcore-seconds Log Aggregation Status : SUCCEEDED Diagnostics : Application application_1503203977699_0014 failed 20 times due to AM Container for appattempt_1503203977699_0014_000020 exited with exitCode: 1 For more detailed output, check the application tracking page: https://host1:8090/cluster/app/application_1503203977699_0014 Then click on links to logs of each attempt. Diagnostics: Exception from container-launch. Container id: container_e04_1503203977699_0014_20_000001 Exit code: 1 Stack trace: org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException: Launch container failed at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.launchContainer(DefaultLinuxContainerRuntime.java:109) at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.launchContainer(DelegatingLinuxContainerRuntime.java:89) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:392) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:317) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Shell output: main : command provided 1 main : run as user is hrt_qa main : requested yarn user is hrt_qa Getting exit code file... Creating script paths... Writing pid file... Writing to tmp file /grid/0/hadoop/yarn/local/nmPrivate/application_1503203977699_0014/container_e04_1503203977699_0014_20_000001/container_e04_1503203977699_0014_20_000001.pid.tmp Writing to cgroup task files... Creating local dirs... Launching container... Getting exit code file... Creating script paths... Container exited with a non-zero exit code 1 Failing this attempt. Failing the application. Unmanaged Application : false Application Node Label Expression : <Not set> AM container Node Label Expression : <DEFAULT_PARTITION>
However, RM UI "All application" page still shows the application in "RUNNING" State.
https://host1:8090/cluster
On clicking application_id ( https://host1:8090/cluster/app/application_1503203977699_0014) , it redirects to application page and there it shows correct application state = Failed.
The App status is not getting updated on Yarn All Application page.
Attachments
Attachments
Issue Links
- duplicates
-
YARN-7163 RMContext need not to be injected to webapp and other Always Running services.
- Resolved