Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-10672

All testcases in TestReservations are flaky

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.4.0, 3.3.1, 3.2.3
    • 3.4.0, 3.3.1, 3.2.3
    • reservation system, test
    • None

    Description

      All testcases in TestReservations are flaky

      Running a particular test in TestReservations 100 times never passes all the time.
      For example, let's run testReservationNoContinueLook 100 times. For me, it produced 39 failed and 61 passed results.
      Sometimes just 1 out of 100 runs is failed.
      Screenshot is attached.

      Stacktrace:

      java.lang.AssertionError: 
      Expected :2048
      Actual   :0
      <Click to see difference>
      
      at org.junit.Assert.fail(Assert.java:89)
      at org.junit.Assert.failNotEquals(Assert.java:835)
      at org.junit.Assert.assertEquals(Assert.java:647)
      at org.junit.Assert.assertEquals(Assert.java:633)
      at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:642)
      

      The test fails here:

       // Start testing...
      // Only AM
      TestUtils.applyResourceCommitRequest(clusterResource,
          a.assignContainers(clusterResource, node_0,
              new ResourceLimits(clusterResource),
              SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY), nodes, apps);
      assertEquals(2 * GB, a.getUsedResources().getMemorySize());
      

      With some debugging (patch attached), I realized that sometimes there are no registered nodes so the AM can't be allocated and test will fail:

      2021-03-04 21:58:25,434 DEBUG [main] allocator.RegularContainerAllocator (RegularContainerAllocator.java:canAssign(312)) - ******Can't assign container, no nodes... rmContext: 2a8dd942, scheduler: 2322e56f
      

      In these cases, this is also printed from org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#getNumClusterNodes:

      2021-03-04 21:58:25,379 DEBUG [main] capacity.CapacityScheduler (CapacityScheduler.java:getNumClusterNodes(290)) - ***Called real getNumClusterNodes
      

      Let's break this down:

      1. The mocking happens in org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations#setup(org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerConfiguration, boolean):

      cs.setRMContext(spyRMContext);
      cs.init(csConf);
      cs.start();
      
      when(cs.getNumClusterNodes()).thenReturn(3);
      

      Under no circumstances this could be allowed to return any other value than 3.
      However, as mentioned above, sometimes the real method of 'getNumClusterNodes' is called on CapacityScheduler.

      2. Sometimes, this gets printed to the console:

      org.mockito.exceptions.misusing.WrongTypeOfReturnValue: 
      Integer cannot be returned by isMultiNodePlacementEnabled()
      isMultiNodePlacementEnabled() should return boolean
      ***
      If you're unsure why you're getting above error read on.
      Due to the nature of the syntax above problem might occur because:
      1. This exception *might* occur in wrongly written multi-threaded tests.
         Please refer to Mockito FAQ on limitations of concurrency testing.
      2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub spies - 
         - with doReturn|Throw() family of methods. More in javadocs for Mockito.spy() method.
      
      
      	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:166)
      	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:114)
      	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:566)
      	at sun.reflect.GeneratedMethodAccessor34.invoke(Unknown Source)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:498)
      	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
      	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
      	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
      	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
      	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
      	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
      	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
      	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
      	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
      	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
      	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
      	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
      	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
      	at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
      	at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
      	at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:69)
      	at com.intellij.rt.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:40)
      	at com.intellij.rt.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:220)
      	at com.intellij.rt.junit.JUnitStarter.main(JUnitStarter.java:53)
      
      

      TestReservations.java:166 is exactly our suspicious stubbing call:

      when(cs.getNumClusterNodes()).thenReturn(3);
      

      But what the heck is the relation between this and the isMultiNodePlacementEnabled method?
      Let's dive into this more.

      3. The only caller of isMultiNodePlacementEnabled is org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager#dynamicallyUpdateAppActivitiesMaxQueueLengthIfNeeded, that gets called from the anonymous Thread instance (assigned to field 'cleanupThread') in class ActivitiesManager
      The ActivitiesManager.serviceStart is invoked when the CapacityScheduler is started.

      Theory: When CS is started here, the ActivitiesManager starts running and eventually cs.isMultiNodePlacementEnabled() will be called by the thread here.
      As the main thread (JUnit) started to stub method 'getNumClusterNodes', there's an ongoing stubbing so when isMultiNodePlacementEnabled is called, the Answer will hold the value of 3, which is improper for a boolean method, so the stubbing will fail.
      This is a race-condition situation, it is presented with 2 screenshots.
      As the stubbing failed, subsequent calls to getNumClusterNodes will return 0 as the original implementation returns 0 for the tests. This is the root cause of the test failures.

      4. Some references to this mockito "issue":

      CONCLUSION & FIX
      As all tests in this class are affected by the setup method and this thread-safety issue, we need to fix it consistently.
      Actually, the fix is quite simple: All the stub calls of the CS instance should happen before the CS.init / CS.start calls, that triggers the ActivitiesManager thread.

      Attachments

        1. YARN-10672-debuglogs.patch
          24 kB
          Szilard Nemeth
        2. YARN-10672.branch-3.3.001.patch
          2 kB
          Szilard Nemeth
        3. YARN-10672.branch-3.2.001.patch
          2 kB
          Peter Bacsko
        4. YARN-10672.001.patch
          2 kB
          Szilard Nemeth
        5. Screenshot-mockitostubbing2-2021-03-04 at 22.34.12.png
          389 kB
          Szilard Nemeth
        6. Screenshot-mockitostubbing1-2021-03-04 at 22.34.01.png
          210 kB
          Szilard Nemeth
        7. Screenshot 2021-03-04 at 22.06.20.png
          306 kB
          Szilard Nemeth
        8. Screenshot 2021-03-04 at 21.34.18.png
          284 kB
          Szilard Nemeth

        Issue Links

          Activity

            People

              snemeth Szilard Nemeth
              snemeth Szilard Nemeth
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: