Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-7048

Uber AM can crash due to unknown task in statusUpdate

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.1.0, 3.0.1, 2.10.0, 2.9.1, 2.8.4, 2.7.6
    • 3.1.0, 3.0.1, 2.10.0, 2.9.1, 2.8.4, 2.7.6
    • mr-am
    • None
    • Reviewed

    Description

      The testcase TestUberAM#testThreadDumpOnTaskTimeout was supposed to be fixed by MAPREDUCE-7020. However, it still fails, see: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/7325/testReport/junit/org.apache.hadoop.mapreduce.v2/TestMRJobs/testThreadDumpOnTaskTimeout/ (note: other tests failed as well, but those look unrelated).

      When I tried to reproduce it locally, it failed again, although with a slightly different error message (it was actually the same as before):

      [INFO] -------------------------------------------------------
      [INFO]  T E S T S
      [INFO] -------------------------------------------------------
      [INFO] Running org.apache.hadoop.mapreduce.v2.TestUberAM
      [ERROR] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 128.192 s <<< FAILURE! - in org.apache.hadoop.mapreduce.v2.TestUberAM
      [ERROR] testThreadDumpOnTaskTimeout(org.apache.hadoop.mapreduce.v2.TestUberAM)  Time elapsed: 79.539 s  <<< FAILURE!
      java.lang.AssertionError: No AppMaster log found! expected:<1> but was:<2>
      	at org.junit.Assert.fail(Assert.java:88)
      	at org.junit.Assert.failNotEquals(Assert.java:743)
      	at org.junit.Assert.assertEquals(Assert.java:118)
      	at org.junit.Assert.assertEquals(Assert.java:555)
      	at org.apache.hadoop.mapreduce.v2.TestMRJobs.testThreadDumpOnTaskTimeout(TestMRJobs.java:1228)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:498)
      	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
      	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
      	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
      	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
      	at org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
      

      Root cause: System.exit() is still invoked at Task.statusUpdate()

        public void statusUpdate(TaskUmbilicalProtocol umbilical) 
        throws IOException {
          int retries = MAX_RETRIES;
          while (true) {
            try {
              if (!umbilical.statusUpdate(getTaskID(), taskStatus).getTaskFound()) {
                LOG.warn("Parent died.  Exiting "+taskId);
                System.exit(66);
              }
              taskStatus.clearStatus();
              return;
              ...
      

      At this point, the task was not found and return value of umbilical.statusUpdate() is false. Checking whether we run in uber mode seems to solve the problem.

      Attachments

        1. MAPREDUCE-7048-001.patch
          1 kB
          Peter Bacsko
        2. MAPREDUCE-7048-002.patch
          5 kB
          Peter Bacsko
        3. MAPREDUCE-7048-003.patch
          5 kB
          Peter Bacsko
        4. MAPREDUCE-7048-branch-2.01.patch
          5 kB
          Peter Bacsko
        5. MAPREDUCE-7048-branch-2.8.01.patch
          5 kB
          Peter Bacsko
        6. MAPREDUCE-7048-branch-2.7.01.patch
          5 kB
          Peter Bacsko
        7. MAPREDUCE-7048-branch-2.9.01.patch
          5 kB
          Peter Bacsko
        8. MAPREDUCE-7048-branch-2.7.01.patch
          5 kB
          Jason Darrell Lowe

        Activity

          People

            pbacsko Peter Bacsko
            pbacsko Peter Bacsko
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: