Details
Description
All testcases in TestReservations are flaky
Running a particular test in TestReservations 100 times never passes all the time.
For example, let's run testReservationNoContinueLook 100 times. For me, it produced 39 failed and 61 passed results.
Sometimes just 1 out of 100 runs is failed.
Screenshot is attached.
Stacktrace:
java.lang.AssertionError: Expected :2048 Actual :0 <Click to see difference> at org.junit.Assert.fail(Assert.java:89) at org.junit.Assert.failNotEquals(Assert.java:835) at org.junit.Assert.assertEquals(Assert.java:647) at org.junit.Assert.assertEquals(Assert.java:633) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:642)
The test fails here:
// Start testing... // Only AM TestUtils.applyResourceCommitRequest(clusterResource, a.assignContainers(clusterResource, node_0, new ResourceLimits(clusterResource), SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY), nodes, apps); assertEquals(2 * GB, a.getUsedResources().getMemorySize());
With some debugging (patch attached), I realized that sometimes there are no registered nodes so the AM can't be allocated and test will fail:
2021-03-04 21:58:25,434 DEBUG [main] allocator.RegularContainerAllocator (RegularContainerAllocator.java:canAssign(312)) - ******Can't assign container, no nodes... rmContext: 2a8dd942, scheduler: 2322e56f
In these cases, this is also printed from org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#getNumClusterNodes:
2021-03-04 21:58:25,379 DEBUG [main] capacity.CapacityScheduler (CapacityScheduler.java:getNumClusterNodes(290)) - ***Called real getNumClusterNodes
Let's break this down:
1. The mocking happens in org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations#setup(org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerConfiguration, boolean):
cs.setRMContext(spyRMContext); cs.init(csConf); cs.start(); when(cs.getNumClusterNodes()).thenReturn(3);
Under no circumstances this could be allowed to return any other value than 3.
However, as mentioned above, sometimes the real method of 'getNumClusterNodes' is called on CapacityScheduler.
2. Sometimes, this gets printed to the console:
org.mockito.exceptions.misusing.WrongTypeOfReturnValue: Integer cannot be returned by isMultiNodePlacementEnabled() isMultiNodePlacementEnabled() should return boolean *** If you're unsure why you're getting above error read on. Due to the nature of the syntax above problem might occur because: 1. This exception *might* occur in wrongly written multi-threaded tests. Please refer to Mockito FAQ on limitations of concurrency testing. 2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub spies - - with doReturn|Throw() family of methods. More in javadocs for Mockito.spy() method. at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:166) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:114) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:566) at sun.reflect.GeneratedMethodAccessor34.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) at org.junit.runners.ParentRunner.run(ParentRunner.java:363) at org.junit.runner.JUnitCore.run(JUnitCore.java:137) at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:69) at com.intellij.rt.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:40) at com.intellij.rt.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:220) at com.intellij.rt.junit.JUnitStarter.main(JUnitStarter.java:53)
TestReservations.java:166 is exactly our suspicious stubbing call:
when(cs.getNumClusterNodes()).thenReturn(3);
But what the heck is the relation between this and the isMultiNodePlacementEnabled method?
Let's dive into this more.
3. The only caller of isMultiNodePlacementEnabled is org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager#dynamicallyUpdateAppActivitiesMaxQueueLengthIfNeeded, that gets called from the anonymous Thread instance (assigned to field 'cleanupThread') in class ActivitiesManager
The ActivitiesManager.serviceStart is invoked when the CapacityScheduler is started.
Theory: When CS is started here, the ActivitiesManager starts running and eventually cs.isMultiNodePlacementEnabled() will be called by the thread here.
As the main thread (JUnit) started to stub method 'getNumClusterNodes', there's an ongoing stubbing so when isMultiNodePlacementEnabled is called, the Answer will hold the value of 3, which is improper for a boolean method, so the stubbing will fail.
This is a race-condition situation, it is presented with 2 screenshots.
As the stubbing failed, subsequent calls to getNumClusterNodes will return 0 as the original implementation returns 0 for the tests. This is the root cause of the test failures.
4. Some references to this mockito "issue":
- https://github.com/mockito/mockito/issues/1151
- https://github.com/mockito/mockito/issues/1151#issuecomment-336718714
- https://github.com/mockito/mockito/pull/1200#issuecomment-336717814
The last link points to the documentation that describes the thread-safety of mockito: https://github.com/mockito/mockito/wiki/FAQ#is-mockito-thread-safe
CONCLUSION & FIX
As all tests in this class are affected by the setup method and this thread-safety issue, we need to fix it consistently.
Actually, the fix is quite simple: All the stub calls of the CS instance should happen before the CS.init / CS.start calls, that triggers the ActivitiesManager thread.
Attachments
Attachments
Issue Links
- is related to
-
YARN-10447 TestLeafQueue: ActivitiesManager thread might interfere with ongoing stubbing
- Resolved
-
YARN-10675 Consolidate YARN-10672 and YARN-10447
- Resolved