Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-10460

Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.4.0
    • Component/s: nodemanager, test
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      In our downstream build environment, we're using JUnit 4.13. Recently, we discovered a truly weird test failure in TestNodeStatusUpdater.

      The problem is that timeout handling has changed in Junit 4.13. See the difference between these two snippets:

      4.12

          @Override
          public void evaluate() throws Throwable {
              CallableStatement callable = new CallableStatement();
              FutureTask<Throwable> task = new FutureTask<Throwable>(callable);
              threadGroup = new ThreadGroup("FailOnTimeoutGroup");
              Thread thread = new Thread(threadGroup, task, "Time-limited test");
              thread.setDaemon(true);
              thread.start();
              callable.awaitStarted();
              Throwable throwable = getResult(task, thread);
              if (throwable != null) {
                  throw throwable;
              }
          }
      

       
      4.13

          @Override
          public void evaluate() throws Throwable {
              CallableStatement callable = new CallableStatement();
              FutureTask<Throwable> task = new FutureTask<Throwable>(callable);
              ThreadGroup threadGroup = new ThreadGroup("FailOnTimeoutGroup");
              Thread thread = new Thread(threadGroup, task, "Time-limited test");
              try {
                  thread.setDaemon(true);
                  thread.start();
                  callable.awaitStarted();
                  Throwable throwable = getResult(task, thread);
                  if (throwable != null) {
                      throw throwable;
                  }
              } finally {
                  try {
                      thread.join(1);
                  } catch (InterruptedException e) {
                      Thread.currentThread().interrupt();
                  }
                  try {
                      threadGroup.destroy();  <---- This
                  } catch (IllegalThreadStateException e) {
                      // If a thread from the group is still alive, the ThreadGroup cannot be destroyed.
                      // Swallow the exception to keep the same behavior prior to this change.
                  }
              }
          }
      

      The change comes from https://github.com/junit-team/junit4/pull/1517.

      Unfortunately, destroying the thread group causes an issue because there are all sorts of object caching in the IPC layer. The exception is:

      java.lang.IllegalThreadStateException
      	at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:867)
      	at java.lang.Thread.init(Thread.java:402)
      	at java.lang.Thread.init(Thread.java:349)
      	at java.lang.Thread.<init>(Thread.java:675)
      	at java.util.concurrent.Executors$DefaultThreadFactory.newThread(Executors.java:613)
      	at com.google.common.util.concurrent.ThreadFactoryBuilder$1.newThread(ThreadFactoryBuilder.java:163)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.<init>(ThreadPoolExecutor.java:612)
      	at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925)
      	at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368)
      	at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
      	at org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1136)
      	at org.apache.hadoop.ipc.Client.call(Client.java:1458)
      	at org.apache.hadoop.ipc.Client.call(Client.java:1405)
      	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
      	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
      	at com.sun.proxy.$Proxy81.startContainers(Unknown Source)
      	at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128)
      	at org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown.startContainer(TestNodeManagerShutdown.java:251)
      	at org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater.testNodeStatusUpdaterRetryAndNMShutdown(TestNodeStatusUpdater.java:1576)
      

      Both the clientExecutor in org.apache.hadoop.ipc.Client and the client object in ProtobufRpcEngine/ProtobufRpcEngine2 are stored as long as they're needed. But since the backing thread group is destroyed in the previous test, it's no longer possible to create new threads.

      A quick workaround is to stop the clients and completely clear the ClientCache in ProtobufRpcEngine before each testcase. I tried this and it solves the problem but it feels hacky. Not sure if there is a better approach.

        Attachments

        1. YARN-10460-POC.patch
          3 kB
          Peter Bacsko
        2. YARN-10460-001.patch
          3 kB
          Peter Bacsko
        3. YARN-10460-002.patch
          3 kB
          Peter Bacsko

          Issue Links

            Activity

              People

              • Assignee:
                pbacsko Peter Bacsko
                Reporter:
                pbacsko Peter Bacsko
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: