Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-1397

NullPointerException observed during task failures

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.20.1
    • Fix Version/s: 0.21.0
    • Component/s: tasktracker
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Fixed a race condition involving JvmRunner.kill() and KillTaskAction, which was leading to an NullPointerException causing a transient inconsistent state in JvmManager and failure of tasks.

      Description

      In an environment where many jobs are killed simultaneously, NPEs are observed in the TT/JT logs when a task fails. The situation is aggravated when the taskcontroller.cfg is not configured properly. Below is the exception obtained:

      INFO org.apache.hadoop.mapred.TaskInProgress: Error from <attempt_ID>:
      java.lang.Throwable: Child Error
              at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:529)
      Caused by: java.lang.NullPointerException
              at org.apache.hadoop.mapred.JvmManager$JvmManagerForType.getDetails(JvmManager.java:329)
              at org.apache.hadoop.mapred.JvmManager$JvmManagerForType.reapJvm(JvmManager.java:315)
              at org.apache.hadoop.mapred.JvmManager$JvmManagerForType.access$000(JvmManager.java:146)
              at org.apache.hadoop.mapred.JvmManager.launchJvm(JvmManager.java:109)
              at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:502)
      
       
      1. patch-1397-3.txt
        13 kB
        Amareshwari Sriramadasu
      2. patch-1397-ydist.txt
        13 kB
        Amareshwari Sriramadasu
      3. ASF.LICENSE.NOT.GRANTED--patch-1397-2.txt
        11 kB
        Amareshwari Sriramadasu
      4. patch-1397-1.txt
        11 kB
        Amareshwari Sriramadasu
      5. patch-1397.txt
        10 kB
        Amareshwari Sriramadasu

        Issue Links

          Activity

          Hide
          Amareshwari Sriramadasu added a comment -

          After looking at the TaskTracker logs, we found the problem is as follows:
          One of the task attempts failed to launch jvm. Finally block of JvmRunner.runChild() calls kill(), which calls terminateTask() which also fails. Then it will sleep for configured duration (default, 5 seconds) and then calls killTask(). Then it removes the jvmid mapping from jvmIdToRunner map.
          Meanwhile, there was a killTaskAction for the same attempt from TaskTracker. This call removes the jvmId mapping from jvmToRunningTask. Then, it sees that JvmRunner.kill() is already called and it goes ahead and releases slot.
          As there are free slots, TaskTracker tries to launch a task and finds the JvmManager in inconsistent state, since the jvm is not yet removed from jvmIdToRunner map. When it tries to find the details through getDetails(), it gets NullPointerException since jvmToRunningTask does not have an entry for the same.

          I think JvmRunner.kill() should not do a back call to JvmManager for removing jvmid mapping from jvmIdToRunner map. The removal should be done by the callers of kill(). i.e. killJvm(), stop() and reapJvm(). JvmRunner.runChild() already does from UpdateOnJvmExit(), in next method call after kill().
          Thoughts?

          Show
          Amareshwari Sriramadasu added a comment - After looking at the TaskTracker logs, we found the problem is as follows: One of the task attempts failed to launch jvm. Finally block of JvmRunner.runChild() calls kill(), which calls terminateTask() which also fails. Then it will sleep for configured duration (default, 5 seconds) and then calls killTask(). Then it removes the jvmid mapping from jvmIdToRunner map. Meanwhile, there was a killTaskAction for the same attempt from TaskTracker. This call removes the jvmId mapping from jvmToRunningTask. Then, it sees that JvmRunner.kill() is already called and it goes ahead and releases slot. As there are free slots, TaskTracker tries to launch a task and finds the JvmManager in inconsistent state, since the jvm is not yet removed from jvmIdToRunner map. When it tries to find the details through getDetails(), it gets NullPointerException since jvmToRunningTask does not have an entry for the same. I think JvmRunner.kill() should not do a back call to JvmManager for removing jvmid mapping from jvmIdToRunner map. The removal should be done by the callers of kill(). i.e. killJvm(), stop() and reapJvm(). JvmRunner.runChild() already does from UpdateOnJvmExit(), in next method call after kill(). Thoughts?
          Hide
          Amareshwari Sriramadasu added a comment -

          One of the task attempts failed to launch jvm. Finally block of JvmRunner.runChild() calls kill(), which calls terminateTask() which also fails.

          The failures for launching jvm and terminateTask() were because of misconfigured taskcontroller.cfg.

          Show
          Amareshwari Sriramadasu added a comment - One of the task attempts failed to launch jvm. Finally block of JvmRunner.runChild() calls kill(), which calls terminateTask() which also fails. The failures for launching jvm and terminateTask() were because of misconfigured taskcontroller.cfg.
          Hide
          Amareshwari Sriramadasu added a comment -

          Patch fixing the bug.
          Added a testcase which fails with NPE without the patch, and passes after the patch.

          Show
          Amareshwari Sriramadasu added a comment - Patch fixing the bug. Added a testcase which fails with NPE without the patch, and passes after the patch.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12438177/patch-1397.txt
          against trunk revision 920250.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 2 new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/24/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/24/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/24/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/24/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12438177/patch-1397.txt against trunk revision 920250. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 2 new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/24/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/24/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/24/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/24/console This message is automatically generated.
          Hide
          Amareshwari Sriramadasu added a comment -

          Patch fixing the findbugs warnings.

          Show
          Amareshwari Sriramadasu added a comment - Patch fixing the findbugs warnings.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12438253/patch-1397-1.txt
          against trunk revision 920250.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 4 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/25/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/25/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/25/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/25/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12438253/patch-1397-1.txt against trunk revision 920250. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/25/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/25/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/25/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/25/console This message is automatically generated.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          The patch still has other problems. Even though with the attached patch JvmManager state becomes consistent, because even though JvmRuner.kill() is in progress, other threads trying to kill a JVM will return immediately and may cause other inconsistencies. Fundamentally, I think all threads trying to do a kill on the same JVM should block till the killing finishes.

          Argh.. with this bug in view, I see so many design problems in JvmManager. Time for some refactoring!

          Show
          Vinod Kumar Vavilapalli added a comment - The patch still has other problems. Even though with the attached patch JvmManager state becomes consistent, because even though JvmRuner.kill() is in progress, other threads trying to kill a JVM will return immediately and may cause other inconsistencies. Fundamentally, I think all threads trying to do a kill on the same JVM should block till the killing finishes. Argh.. with this bug in view, I see so many design problems in JvmManager. Time for some refactoring!
          Hide
          Amareshwari Sriramadasu added a comment -

          Patch makes the method JvmRunner.kill() synchronized so that if kill is in-progress, the caller will wait till kill completes. Making the method JvmRunner.kill() synchronized is fine because all the external callers are already synchronized on JvmManager. Now patch makes it synchronized on JvmRunner also.

          Show
          Amareshwari Sriramadasu added a comment - Patch makes the method JvmRunner.kill() synchronized so that if kill is in-progress, the caller will wait till kill completes. Making the method JvmRunner.kill() synchronized is fine because all the external callers are already synchronized on JvmManager. Now patch makes it synchronized on JvmRunner also.
          Hide
          Amareshwari Sriramadasu added a comment -

          For some reason Hudson could not comment the test result. Adding test result output from console :

          [exec]
          [exec]
          [exec] -1 overall. Here are the results of testing the latest attachment
          [exec] http://issues.apache.org/jira/secure/attachment/12441605/patch-1397-2.txt
          [exec] against trunk revision 933441.
          [exec]
          [exec] +1 @author. The patch does not contain any @author tags.
          [exec]
          [exec] +1 tests included. The patch appears to include 3 new or modified tests.
          [exec]
          [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
          [exec]
          [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
          [exec]
          [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
          [exec]
          [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.
          [exec]
          [exec] +1 core tests. The patch passed core unit tests.
          [exec]
          [exec] -1 contrib tests. The patch failed contrib unit tests.
          [exec]
          [exec] Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/101/testReport/
          [exec] Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/101/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          [exec] Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/101/artifact/trunk/build/test/checkstyle-errors.html
          [exec] Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/101/console

          Show
          Amareshwari Sriramadasu added a comment - For some reason Hudson could not comment the test result. Adding test result output from console : [exec] [exec] [exec] -1 overall. Here are the results of testing the latest attachment [exec] http://issues.apache.org/jira/secure/attachment/12441605/patch-1397-2.txt [exec] against trunk revision 933441. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec] [exec] +1 core tests. The patch passed core unit tests. [exec] [exec] -1 contrib tests. The patch failed contrib unit tests. [exec] [exec] Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/101/testReport/ [exec] Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/101/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html [exec] Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/101/artifact/trunk/build/test/checkstyle-errors.html [exec] Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/101/console
          Hide
          Amareshwari Sriramadasu added a comment -

          -1 contrib tests.

          Test report does not show any failures. All contrib tests passed on my machine.

          Show
          Amareshwari Sriramadasu added a comment - -1 contrib tests. Test report does not show any failures. All contrib tests passed on my machine.
          Hide
          Amareshwari Sriramadasu added a comment -

          Patch for yahoo! distribution with a couple of bug fixes in test case.

          Show
          Amareshwari Sriramadasu added a comment - Patch for yahoo! distribution with a couple of bug fixes in test case.
          Hide
          Amareshwari Sriramadasu added a comment -

          Patch for trunk with changes in testcase.

          Show
          Amareshwari Sriramadasu added a comment - Patch for trunk with changes in testcase.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12442658/patch-1397-3.txt
          against trunk revision 936166.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/130/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/130/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/130/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/130/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12442658/patch-1397-3.txt against trunk revision 936166. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/130/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/130/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/130/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/130/console This message is automatically generated.
          Hide
          Amareshwari Sriramadasu added a comment -

          Test failures seem unrelated, resubmitting for hudson.

          Show
          Amareshwari Sriramadasu added a comment - Test failures seem unrelated, resubmitting for hudson.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12442658/patch-1397-3.txt
          against trunk revision 938023.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/140/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/140/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/140/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/140/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12442658/patch-1397-3.txt against trunk revision 938023. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/140/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/140/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/140/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/140/console This message is automatically generated.
          Hide
          Amareshwari Sriramadasu added a comment -

          -1 core tests.

          Test failure is because of MAPREDUCE-1727.

          Show
          Amareshwari Sriramadasu added a comment - -1 core tests. Test failure is because of MAPREDUCE-1727 .
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Patch looks good. I myself reviewed the patch for ydist and the fact that the same patch applies cleanly on trunk too helps!

          But like I pointed before the interactions of JvmManager with TaskRunner, TaskTracker and TaskController are really nasty and unmaintainable. Will open a JIRA for refactoring so as to align these interfaces better.

          We can't wait for the refactor, I am going to commit this patch for now.

          Show
          Vinod Kumar Vavilapalli added a comment - Patch looks good. I myself reviewed the patch for ydist and the fact that the same patch applies cleanly on trunk too helps! But like I pointed before the interactions of JvmManager with TaskRunner, TaskTracker and TaskController are really nasty and unmaintainable. Will open a JIRA for refactoring so as to align these interfaces better. We can't wait for the refactor, I am going to commit this patch for now.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          I committed this. Thanks Amareshwar!

          Show
          Vinod Kumar Vavilapalli added a comment - I committed this. Thanks Amareshwar!

            People

            • Assignee:
              Amareshwari Sriramadasu
              Reporter:
              Ramya Sunil
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development