Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
Reviewed
Description
TestDistributedShell times out on trunk. I found that the application, and containers will stay running in the background long after the unit test has failed.
This causes failure of other test cases and several false positives failures as result of:
- Ports will stay busy, so other tests cases fail to launch.
- Unit tests fail because of memory restrictions.
Although the unit test is already broken on trunk, we do not want its failures to other unit tests.
TestDistributedShell needs to be revisited to make sure that all YarnClients, and YarnApplications are closed properly at the end of the each unit test (including exception and timeouts)
Steps to reproduce:
mvn test -Dtest=TestDistributedShell#testDSShellWithOpportunisticContainers ## this will timeout as [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 90.234 s <<< FAILURE! - in org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell [ERROR] testDSShellWithOpportunisticContainers(org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell) Time elapsed: 90.018 s <<< ERROR! org.junit.runners.model.TestTimedOutException: test timed out after 90000 milliseconds at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.yarn.applications.distributedshell.Client.monitorApplication(Client.java:1117) at org.apache.hadoop.yarn.applications.distributedshell.Client.run(Client.java:1089) at org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShellWithOpportunisticContainers(TestDistributedShell.java:1438) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.lang.Thread.run(Thread.java:748) [INFO] [INFO] Results: [INFO] [ERROR] Errors: [ERROR] TestDistributedShell.testDSShellWithOpportunisticContainers:1438 » TestTimedOut [INFO] [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0
Using ps command, you can find the yarn processes are still in the background
/bin/bash -c $JRE_HOME/bin/java -Xmx512m org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster --container_type OPPORTUNISTIC --container_memory 128 --container_vcores 1 --num_containers 2 --priority 0 --appname DistributedShell --homedir file:/Users/ahussein 1>$WORK_DIR8/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/target/TestDistributedShell/TestDistributedShell-logDir-nm-0_0/application_1593554710896_0001/container_1593554710896_0001_01_000001/AppMaster.stdout 2>$WORK_DIR8/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/target/TestDistributedShell/TestDistributedShell-logDir-nm-0_0/application_1593554710896_0001/container_1593554710896_0001_01_000001/AppMaster.stderr $JRE_HOME/bin/java -Xmx512m org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster --container_type OPPORTUNISTIC --container_memory 128 --container_vcores 1 --num_containers 2 --priority 0 --appname DistributedShell --homedir file:/Users/ahussein
Attachments
Issue Links
- depends upon
-
YARN-10536 Client in distributedShell swallows interrupt exceptions
- Resolved
- is related to
-
YARN-10040 DistributedShell test failure on X86 and ARM
- Resolved
-
YARN-10553 Refactor TestDistributedShell
- Resolved
- links to
Nice finding ahussein. It could potentially cause lots of intermittent issues in Hadoop's unit test runs.
I think revisiting this test may not be that easy, but I hope someone can afford some time to look at it.