Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-10334

TestDistributedShell leaks resources on timeout/failure

Details

    • Reviewed

    Description

      TestDistributedShell times out on trunk. I found that the application, and containers will stay running in the background long after the unit test has failed.
      This causes failure of other test cases and several false positives failures as result of:

      • Ports will stay busy, so other tests cases fail to launch.
      • Unit tests fail because of memory restrictions.

      Although the unit test is already broken on trunk, we do not want its failures to other unit tests.
      TestDistributedShell needs to be revisited to make sure that all YarnClients, and YarnApplications are closed properly at the end of the each unit test (including exception and timeouts)

      Steps to reproduce:

      mvn test -Dtest=TestDistributedShell#testDSShellWithOpportunisticContainers
      
      ## this will timeout as
      [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 90.234 s <<< FAILURE! - in org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell
      [ERROR] testDSShellWithOpportunisticContainers(org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell)  Time elapsed: 90.018 s  <<< ERROR!
      org.junit.runners.model.TestTimedOutException: test timed out after 90000 milliseconds
              at java.lang.Thread.sleep(Native Method)
              at org.apache.hadoop.yarn.applications.distributedshell.Client.monitorApplication(Client.java:1117)
              at org.apache.hadoop.yarn.applications.distributedshell.Client.run(Client.java:1089)
              at org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShellWithOpportunisticContainers(TestDistributedShell.java:1438)
              at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
              at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
              at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
              at java.lang.reflect.Method.invoke(Method.java:498)
              at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
              at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
              at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
              at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
              at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
              at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
              at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
              at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
              at java.util.concurrent.FutureTask.run(FutureTask.java:266)
              at java.lang.Thread.run(Thread.java:748)
      
      [INFO] 
      [INFO] Results:
      [INFO] 
      [ERROR] Errors: 
      [ERROR]   TestDistributedShell.testDSShellWithOpportunisticContainers:1438 » TestTimedOut
      [INFO] 
      [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0
      

      Using ps command, you can find the yarn processes are still in the background

      /bin/bash -c $JRE_HOME/bin/java -Xmx512m org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster --container_type OPPORTUNISTIC --container_memory 128 --container_vcores 1 --num_containers 2 --priority 0 --appname DistributedShell --homedir file:/Users/ahussein 1>$WORK_DIR8/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/target/TestDistributedShell/TestDistributedShell-logDir-nm-0_0/application_1593554710896_0001/container_1593554710896_0001_01_000001/AppMaster.stdout 2>$WORK_DIR8/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/target/TestDistributedShell/TestDistributedShell-logDir-nm-0_0/application_1593554710896_0001/container_1593554710896_0001_01_000001/AppMaster.stderr
      
      
      $JRE_HOME/bin/java -Xmx512m org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster --container_type OPPORTUNISTIC --container_memory 128 --container_vcores 1 --num_containers 2 --priority 0 --appname DistributedShell --homedir file:/Users/ahussein
      

      Attachments

        Issue Links

          Activity

            adam.antal Adam Antal added a comment -

            Nice finding ahussein. It could potentially cause lots of intermittent issues in Hadoop's unit test runs.

            I think revisiting this test may not be that easy, but I hope someone can afford some time to look at it.

            adam.antal Adam Antal added a comment - Nice finding ahussein . It could potentially cause lots of intermittent issues in Hadoop's unit test runs. I think revisiting this test may not be that easy, but I hope someone can afford some time to look at it.
            ahussein Ahmed Hussein added a comment -

            Those are the steps going to fix the problem

            • YARN-10536 is going to make the thread responsive in. handling exceptions.
            • Pass timeout argument to the DistributedShell.Client. This timeout has to be smaller than the TestDistributedShell.timeout rule.
            • Optional: Client and YarnClient have no interfaces to shutdown/close. Adding such methods to be accessed by the unit tests will be a good addition in order to clean out the code.
            ahussein Ahmed Hussein added a comment - Those are the steps going to fix the problem YARN-10536  is going to make the thread responsive in. handling exceptions. Pass timeout argument to the DistributedShell.Client . This timeout has to be smaller than the TestDistributedShell.timeout rule. Optional: Client and YarnClient have no interfaces to shutdown/close. Adding such methods to be accessed by the unit tests will be a good addition in order to clean out the code.
            elgoiri Íñigo Goiri added a comment -

            Thanks ahussein for the improvement.
            Merged PR 2571.

            elgoiri Íñigo Goiri added a comment - Thanks ahussein for the improvement. Merged PR 2571.

            People

              ahussein Ahmed Hussein
              ahussein Ahmed Hussein
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 10m
                  1h 10m