Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-1709

TestScmSafeNode is flaky

    XMLWordPrintableJSON

Details

    • Done

    Description

      org.apache.hadoop.ozone.om.TestScmSafeMode.testSCMSafeMode is failed at last night with the following error:

      java.lang.AssertionError at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.hadoop.ozone.om.TestScmSafeMode.testSCMSafeMode(TestScmSafeMode.java:285) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)

      Locally it can be tested but it's very easy to reproduce by adding an additional sleep DataNodeSafeModeRule:

      +++ b/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/DataNodeSafeModeRule.java
      @@ -63,7 +63,11 @@ protected boolean validate() {
       
         @Override
         protected void process(NodeRegistrationContainerReport reportsProto) {
      -
      +    try {
      +      Thread.sleep(3000);
      +    } catch (InterruptedException e) {
      +      e.printStackTrace();
      +    }

      This is a clear race condition:

      DatanodeSafeModeRule and ContainerSafeModeRule are processing the same events but it can be possible (in case of an accidental sleep) that the container safe mode rule is done, but DatanodeSafeModeRule didn't process the new event (yet).

      As a result the test execution will continue:

      GenericTestUtils
          .waitFor(() -> scm.getCurrentContainerThreshold() == 1.0, 100, 20000);
      

      (This line is waiting ONLY for the ContainerSafeModeRule).

      The fix is easy, let's wait for the processing of all the async events:

      EventQueue eventQueue =
          (EventQueue) cluster.getStorageContainerManager().getEventQueue();
      eventQueue.processAll(5000L);

      As we are sure that the events are already sent to the EventQueue (because we have the previous waitFor), it should be enough.

      Attachments

        Issue Links

          Activity

            People

              elek Marton Elek
              elek Marton Elek
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 50m
                  50m