Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-2666

TestFairScheduler.testContinuousScheduling fails Intermittently

    Details

    • Type: Test
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.8.0, 3.0.0-alpha1
    • Component/s: scheduler
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      The test fails on trunk.

      Tests run: 79, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 8.698 sec <<< FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler
      testContinuousScheduling(org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler)  Time elapsed: 0.582 sec  <<< FAILURE!
      java.lang.AssertionError: expected:<2> but was:<1>
      	at org.junit.Assert.fail(Assert.java:88)
      	at org.junit.Assert.failNotEquals(Assert.java:743)
      	at org.junit.Assert.assertEquals(Assert.java:118)
      	at org.junit.Assert.assertEquals(Assert.java:555)
      	at org.junit.Assert.assertEquals(Assert.java:542)
      	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler.testContinuousScheduling(TestFairScheduler.java:3372)
      

        Issue Links

          Activity

          Hide
          zxu zhihai xu added a comment -

          Hi Wei Yan, Could you assign this JIRA to me?
          I think I know what cause this Intermittent failure.
          The problem is because ContinuousSchedulingThread is calling continuousSchedulingAttempt periodically.
          And continuousSchedulingAttempt doesn't hold the FairScheduler lock.
          continuousSchedulingAttempt can run at any time,

              for (NodeId nodeId : nodeIdList) {
                FSSchedulerNode node = getFSSchedulerNode(nodeId);
                try {
                  if (node != null && Resources.fitsIn(minimumAllocation,
                      node.getAvailableResource())) {
                    attemptScheduling(node);
                  }
                } catch (Throwable ex) {
                  LOG.error("Error while attempting scheduling for node " + node +
                      ": " + ex.toString(), ex);
                }
              }
          

          when the testContinuousScheduling run scheduler.allocate to make a container allocation request.
          It is possible application.updateResourceRequests in scheduler.allocate is running right after attemptScheduling first node and before attemptScheduling second node. then the second node with less resource will allocate container for this allocation request.
          Then the issue will happen: both containers are allocated on the same node.
          The default ContinuousSchedulingSleepMs is 5ms which is very short, If we increase ContinuousSchedulingSleepMs, the test failure will be much less. We can make the test deterministic by manually calling continuousSchedulingAttempt after second allocation request and stopping the ContinuousSchedulingThread before second allocation request.
          I uploaded a patch which will stop ContinuousSchedulingThread before second allocation request and manually call continuousSchedulingAttempt after second allocation request.

          Show
          zxu zhihai xu added a comment - Hi Wei Yan , Could you assign this JIRA to me? I think I know what cause this Intermittent failure. The problem is because ContinuousSchedulingThread is calling continuousSchedulingAttempt periodically. And continuousSchedulingAttempt doesn't hold the FairScheduler lock. continuousSchedulingAttempt can run at any time, for (NodeId nodeId : nodeIdList) { FSSchedulerNode node = getFSSchedulerNode(nodeId); try { if (node != null && Resources.fitsIn(minimumAllocation, node.getAvailableResource())) { attemptScheduling(node); } } catch (Throwable ex) { LOG.error( "Error while attempting scheduling for node " + node + ": " + ex.toString(), ex); } } when the testContinuousScheduling run scheduler.allocate to make a container allocation request. It is possible application.updateResourceRequests in scheduler.allocate is running right after attemptScheduling first node and before attemptScheduling second node. then the second node with less resource will allocate container for this allocation request. Then the issue will happen: both containers are allocated on the same node. The default ContinuousSchedulingSleepMs is 5ms which is very short, If we increase ContinuousSchedulingSleepMs, the test failure will be much less. We can make the test deterministic by manually calling continuousSchedulingAttempt after second allocation request and stopping the ContinuousSchedulingThread before second allocation request. I uploaded a patch which will stop ContinuousSchedulingThread before second allocation request and manually call continuousSchedulingAttempt after second allocation request.
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12708403/YARN-2666.000.patch
          against trunk revision b5a22e9.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

          org.apache.hadoop.yarn.server.resourcemanager.security.TestAMRMTokens

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7172//testReport/
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7172//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12708403/YARN-2666.000.patch against trunk revision b5a22e9. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.security.TestAMRMTokens Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7172//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7172//console This message is automatically generated.
          Hide
          zxu zhihai xu added a comment -

          Thanks Wei Yan to assign this JIRA to me. I uploaded a patch YARN-2666.000.patch for review.
          The patch will make sure testContinuousScheduling doesn't depend on timing by assigning the second allocation request to the node with more available resource. Doesn't matter which node the first allocation request is assigned to. Before the second allocation request is called, stop the continuous scheduler thread. After the second allocation request is called, start the continuous scheduler thread. the scheduler will sort the node based on the available resource before assign container to nodes. In this case, the node,which the first allocation request was assigned to, has less available resource. The second allocation request will be assigned to the node with more available resource. So they will be assigned to different nodes.
          The patch didn't touch any code except TestFairScheduler, so the test failure(TestAMRMTokens) is not related to my patch.
          I hit this issue twice yesterday. This Intermittent failure exists for long time, It will be better to fix it.
          Hi Tsuyoshi Ozawa, Could you review the patch? many thanks

          Show
          zxu zhihai xu added a comment - Thanks Wei Yan to assign this JIRA to me. I uploaded a patch YARN-2666 .000.patch for review. The patch will make sure testContinuousScheduling doesn't depend on timing by assigning the second allocation request to the node with more available resource. Doesn't matter which node the first allocation request is assigned to. Before the second allocation request is called, stop the continuous scheduler thread. After the second allocation request is called, start the continuous scheduler thread. the scheduler will sort the node based on the available resource before assign container to nodes. In this case, the node,which the first allocation request was assigned to, has less available resource. The second allocation request will be assigned to the node with more available resource. So they will be assigned to different nodes. The patch didn't touch any code except TestFairScheduler, so the test failure(TestAMRMTokens) is not related to my patch. I hit this issue twice yesterday. This Intermittent failure exists for long time, It will be better to fix it. Hi Tsuyoshi Ozawa , Could you review the patch? many thanks
          Hide
          ywskycn Wei Yan added a comment -

          Hey, zhihai xu, I'll take a look of your patch later today.

          Show
          ywskycn Wei Yan added a comment - Hey, zhihai xu , I'll take a look of your patch later today.
          Hide
          zxu zhihai xu added a comment -

          thanks Wei Yan.

          Show
          zxu zhihai xu added a comment - thanks Wei Yan .
          Hide
          ywskycn Wei Yan added a comment -

          zhihai xu, thanks for your patch. Good catch of the bug. The patch LGTM.

          Show
          ywskycn Wei Yan added a comment - zhihai xu , thanks for your patch. Good catch of the bug. The patch LGTM.
          Hide
          zxu zhihai xu added a comment -

          Wei Yan, many thanks for the review.

          Show
          zxu zhihai xu added a comment - Wei Yan , many thanks for the review.
          Hide
          hadoopqa Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12709083/YARN-2666.000.patch
          against trunk revision 6a6a59d.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7207//testReport/
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7207//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12709083/YARN-2666.000.patch against trunk revision 6a6a59d. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7207//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7207//console This message is automatically generated.
          Hide
          zxu zhihai xu added a comment -

          Hi Tsuyoshi Ozawa, I rebased the patch YARN-2666.000.patch rebased on the latest code base and it passed the Jenkins test.
          Do you have time to review/commit the patch? many thanks

          Show
          zxu zhihai xu added a comment - Hi Tsuyoshi Ozawa , I rebased the patch YARN-2666 .000.patch rebased on the latest code base and it passed the Jenkins test. Do you have time to review/commit the patch? many thanks
          Hide
          ozawa Tsuyoshi Ozawa added a comment -

          OK, I'll check it.

          Show
          ozawa Tsuyoshi Ozawa added a comment - OK, I'll check it.
          Hide
          zxu zhihai xu added a comment -
          Show
          zxu zhihai xu added a comment - thanks Tsuyoshi Ozawa !
          Hide
          ozawa Tsuyoshi Ozawa added a comment -

          +1. It's better to call scheduler.continuousSchedulingAttempt() instead of waiting for scheduling. Committing this shortly.

          Show
          ozawa Tsuyoshi Ozawa added a comment - +1. It's better to call scheduler.continuousSchedulingAttempt() instead of waiting for scheduling. Committing this shortly.
          Hide
          ozawa Tsuyoshi Ozawa added a comment -

          Committed this to trunk and branch-2. Thanks zhihai xu for your contribution.

          Show
          ozawa Tsuyoshi Ozawa added a comment - Committed this to trunk and branch-2. Thanks zhihai xu for your contribution.
          Hide
          zxu zhihai xu added a comment -

          thanks Wei Yan for the review and thanks Tsuyoshi Ozawa for reviewing and committing the patch!

          Show
          zxu zhihai xu added a comment - thanks Wei Yan for the review and thanks Tsuyoshi Ozawa for reviewing and committing the patch!

            People

            • Assignee:
              zxu zhihai xu
              Reporter:
              ozawa Tsuyoshi Ozawa
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development