Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-9436

Flaky test testApplicationLifetimeMonitor

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • None
    • None
    • scheduler, test
    • None

    Description

      In our test environment, we occasionally encounter this failure:

      2019-04-03 12:49:32 [INFO] Running org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor
      2019-04-03 12:53:08 [ERROR] Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 215.535 s <<< FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor
      2019-04-03 12:53:08 [ERROR] testApplicationLifetimeMonitor[0](org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor)  Time elapsed: 34.244 s  <<< FAILURE!
      2019-04-03 12:53:08 java.lang.AssertionError: Application killed before lifetime value
      2019-04-03 12:53:08 	at org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor.testApplicationLifetimeMonitor(TestApplicationLifetimeMonitor.java:218)
      2019-04-03 12:53:08 
      

      The root cause is the condition here:

              Assert.assertTrue("Application killed before lifetime value",
                  totalTimeRun > maxLifetime);
      

      However, there are two problems with this condition:
      1. Logically it's not correct. In fact, since the app should be killed after 30 seconds, one would expect to see totalTimeRun = maxLifetime. Due to some asynchronicity and rounding, most of the time totalTimeRun ends up being 31.

      2. Sometimes the application is killed fast enough and totalTimeRun is 30, but this is correct, because in setUpCSQueue we set the queue lifetime:

          csConf.setMaximumLifetimePerQueue(
              CapacitySchedulerConfiguration.ROOT + ".default", maxLifetime);
          csConf.setDefaultLifetimePerQueue(
              CapacitySchedulerConfiguration.ROOT + ".default", defaultLifetime);
      

      A more proper condition is:

      Assert.assertTrue("Application killed before lifetime value",
                  totalTimeRun >= maxLifetime);
      

      The assertion message in the next line is also misleading:

              Assert.assertTrue(
                  "Application killed before lifetime value " + totalTimeRun,
                  totalTimeRun < maxLifetime + 10L);
      

      If it false, it means that the application is killed after 40 seconds, which exceeds both the app's lifetime (40s) and that of the queue (30s).

              Assert.assertTrue(
                  "Application killed after queue/app lifetime value: " + totalTimeRun,
                  totalTimeRun < maxLifetime + 10L);
      

      We can be even be stricter, since we expect a kill almost immediately after 30 seconds:

              Assert.assertTrue(
                  "Application killed too late: " + totalTimeRun,
                  totalTimeRun < maxLifetime + 2L);
      

      where we allow a 2 second tolerance.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            pbacsko Peter Bacsko
            pbacsko Peter Bacsko
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment