Hadoop Common
  1. Hadoop Common
  2. HADOOP-5374

NPE in JobTracker.getTasksToSave() method

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.19.0
    • Fix Version/s: 0.19.2
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    1. patch-5374-1.txt
      1 kB
      Amareshwari Sriramadasu
    2. patch-5374-2.txt
      1 kB
      Amareshwari Sriramadasu

      Activity

      Vinod Kumar Vavilapalli created issue -
      Hide
      Vinod Kumar Vavilapalli added a comment -

      This affects heartbeats. I see a lot of the following in the JT's logs.

      2009-02-28 08:12:10,332 INFO org.apache.hadoop.ipc.Server: IPC Server handler 34 on 52000, call heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@4b68032e, false, false, true, 5832) f
      rom <ip:port>: error: java.io.IOException: java.lang.NullPointerException
      java.io.IOException: java.lang.NullPointerException
              at org.apache.hadoop.mapred.TaskInProgress.shouldCommit(TaskInProgress.java:420)
              at org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2585)
              at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:2334)
              at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source)
              at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
              at java.lang.reflect.Method.invoke(Method.java:597)
              at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508) org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
              at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
              at java.security.AccessController.doPrivileged(Native Method)
              at javax.security.auth.Subject.doAs(Subject.java:396)
              at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
      

      An offline discussion with Amareshwari concluded that in getTasksToSave() method, the task's state on the JT should also be checked along with the state as reported by the TT.

      Show
      Vinod Kumar Vavilapalli added a comment - This affects heartbeats. I see a lot of the following in the JT's logs. 2009-02-28 08:12:10,332 INFO org.apache.hadoop.ipc.Server: IPC Server handler 34 on 52000, call heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@4b68032e, false , false , true , 5832) f rom <ip:port>: error: java.io.IOException: java.lang.NullPointerException java.io.IOException: java.lang.NullPointerException at org.apache.hadoop.mapred.TaskInProgress.shouldCommit(TaskInProgress.java:420) at org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2585) at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:2334) at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508) org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953) An offline discussion with Amareshwari concluded that in getTasksToSave() method, the task's state on the JT should also be checked along with the state as reported by the TT.
      Hide
      Vinod Kumar Vavilapalli added a comment -

      Here's the exact scenario:

      • A TT gets a task but fails to report back immediately, making the JT mark the task as FAILED.
      • The TT comes back with the task in COMMIT_PENDING state.
      • JT doesn't update the task's state on itself as task cannot go from FAILED to COMMIT_PENDING state. (See state checks in TIP.updateStatus()). Thus TIP.doCommit() is not called.
      • In getTasksToSave(), JT just checks the task's status on TT and tries to save the task, thus hitting NPE in TIP.shouldCommit().

      Effect : The TT will keep getting NPE in heartbeat and become useless. The job will proceed as a duplicate task will be launched for this TIP.

      Show
      Vinod Kumar Vavilapalli added a comment - Here's the exact scenario: A TT gets a task but fails to report back immediately, making the JT mark the task as FAILED. The TT comes back with the task in COMMIT_PENDING state. JT doesn't update the task's state on itself as task cannot go from FAILED to COMMIT_PENDING state. (See state checks in TIP.updateStatus()). Thus TIP.doCommit() is not called. In getTasksToSave(), JT just checks the task's status on TT and tries to save the task, thus hitting NPE in TIP.shouldCommit(). Effect : The TT will keep getting NPE in heartbeat and become useless. The job will proceed as a duplicate task will be launched for this TIP.
      Amareshwari Sriramadasu made changes -
      Field Original Value New Value
      Assignee Amareshwari Sriramadasu [ amareshwari ]
      Hide
      Amareshwari Sriramadasu added a comment -

      There are three threads in JobTracker updating the TaskStatuses : ExpireTrackers, ExpireLaunchingTasks and heartbeat threads.
      1. If heartbeat runs first, there is no issue with ExpireTrackers and ExpireLaunchingTasks threads, since the entry will removed from them.
      2. If ExpireTrackers runs first, there is no issue with heartbeat(The tracker will be re-inited) and ExpireLaunchingTasks(he task-entry will removed from the thread).
      3. If ExpireLaunchingTasks runs first,
      a) If ExpireTrackers runs second, the task attempt will be Killed again. The task will have FAILED, followed by KILLED update.
      b) If heartbeat runs second, If the TaskStatus is UNASSIGNED, RUNNING, COMMIT_PENDING or SUCCEDED, the task will be added to taskToKill map and the update is ignored. So, the tasks will be sent KillTaskAction(fixed by HADOOP-5280). If the state is COMMIT_PENDING, it throws NPE as described in this jira.
      I propose this issue can fix both 3(a) and 3(b).

      Show
      Amareshwari Sriramadasu added a comment - There are three threads in JobTracker updating the TaskStatuses : ExpireTrackers, ExpireLaunchingTasks and heartbeat threads. 1. If heartbeat runs first, there is no issue with ExpireTrackers and ExpireLaunchingTasks threads, since the entry will removed from them. 2. If ExpireTrackers runs first, there is no issue with heartbeat(The tracker will be re-inited) and ExpireLaunchingTasks(he task-entry will removed from the thread). 3. If ExpireLaunchingTasks runs first, a) If ExpireTrackers runs second, the task attempt will be Killed again. The task will have FAILED, followed by KILLED update. b) If heartbeat runs second, If the TaskStatus is UNASSIGNED, RUNNING, COMMIT_PENDING or SUCCEDED, the task will be added to taskToKill map and the update is ignored. So, the tasks will be sent KillTaskAction(fixed by HADOOP-5280 ). If the state is COMMIT_PENDING, it throws NPE as described in this jira. I propose this issue can fix both 3(a) and 3(b).
      Hide
      Amareshwari Sriramadasu added a comment -

      Patch with the fix.

      Show
      Amareshwari Sriramadasu added a comment - Patch with the fix.
      Amareshwari Sriramadasu made changes -
      Attachment patch-5374-1.txt [ 12402354 ]
      Amareshwari Sriramadasu made changes -
      Status Open [ 1 ] Patch Available [ 10002 ]
      Affects Version/s 0.19.0 [ 12313211 ]
      Affects Version/s 0.20.0 [ 12313438 ]
      Fix Version/s 0.19.2 [ 12313650 ]
      Hide
      Hadoop QA added a comment -

      -1 overall. Here are the results of testing the latest attachment
      http://issues.apache.org/jira/secure/attachment/12402354/patch-5374-1.txt
      against trunk revision 755057.

      +1 @author. The patch does not contain any @author tags.

      -1 tests included. The patch doesn't appear to include any new or modified tests.
      Please justify why no tests are needed for this patch.

      +1 javadoc. The javadoc tool did not generate any warning messages.

      +1 javac. The applied patch does not increase the total number of javac compiler warnings.

      +1 findbugs. The patch does not introduce any new Findbugs warnings.

      +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

      +1 release audit. The applied patch does not increase the total number of release audit warnings.

      -1 core tests. The patch failed core unit tests.

      +1 contrib tests. The patch passed contrib unit tests.

      Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/94/testReport/
      Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/94/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
      Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/94/artifact/trunk/build/test/checkstyle-errors.html
      Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/94/console

      This message is automatically generated.

      Show
      Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12402354/patch-5374-1.txt against trunk revision 755057. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/94/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/94/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/94/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/94/console This message is automatically generated.
      Hide
      Amareshwari Sriramadasu added a comment -

      Testfailure is not related to the patch. Raised HADOOP-5517 for the same.

      Show
      Amareshwari Sriramadasu added a comment - Testfailure is not related to the patch. Raised HADOOP-5517 for the same.
      Hide
      Amareshwari Sriramadasu added a comment -

      cancelling patch, current code accepts FAILED_UNCLEAN/KILLED_UNCLEAN states and launches cleanup attempt, though previous state is FAILED. This should not happen, because FAILED->FAILED_UNCLEAN->FAILED updates will fail the task twice, thus doubling the counters and punishing TT twice.
      I would say, It should not accept any state if the previous state is FAILED/KILLED.

      Show
      Amareshwari Sriramadasu added a comment - cancelling patch, current code accepts FAILED_UNCLEAN/KILLED_UNCLEAN states and launches cleanup attempt, though previous state is FAILED. This should not happen, because FAILED->FAILED_UNCLEAN->FAILED updates will fail the task twice, thus doubling the counters and punishing TT twice. I would say, It should not accept any state if the previous state is FAILED/KILLED.
      Amareshwari Sriramadasu made changes -
      Status Patch Available [ 10002 ] Open [ 1 ]
      Hide
      Amareshwari Sriramadasu added a comment -

      Attaching the patch with fix.

      Tested the patch with ExpireLaunchingTasks thread marking the task FAILED, and TT reporting COMMIT_PENDING, FAILED_UNCLEAN, KILLED_UNCLEAN, FAILED, KILLED and SUCCEEDED states for the task.

      Show
      Amareshwari Sriramadasu added a comment - Attaching the patch with fix. Tested the patch with ExpireLaunchingTasks thread marking the task FAILED, and TT reporting COMMIT_PENDING, FAILED_UNCLEAN, KILLED_UNCLEAN, FAILED, KILLED and SUCCEEDED states for the task.
      Amareshwari Sriramadasu made changes -
      Attachment patch-5374-2.txt [ 12402624 ]
      Amareshwari Sriramadasu made changes -
      Status Open [ 1 ] Patch Available [ 10002 ]
      Hide
      Amareshwari Sriramadasu added a comment -

      test-patch result:

           [exec]
           [exec] -1 overall.
           [exec]
           [exec]     +1 @author.  The patch does not contain any @author tags.
           [exec]
           [exec]     -1 tests included.  The patch doesn't appear to include any new or modified tests.
           [exec]                         Please justify why no tests are needed for this patch.
           [exec]
           [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
           [exec]
           [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
           [exec]
           [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
           [exec]
           [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
           [exec]
           [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.
           [exec]
      

      It is difficult to write testcase for this corner case.
      All unit tests passed on my machine. Also Ran MRReliability with the patch.

      Show
      Amareshwari Sriramadasu added a comment - test-patch result: [exec] [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] -1 tests included. The patch doesn't appear to include any new or modified tests. [exec] Please justify why no tests are needed for this patch. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec] It is difficult to write testcase for this corner case. All unit tests passed on my machine. Also Ran MRReliability with the patch.
      Hide
      Vinod Kumar Vavilapalli added a comment -

      Looked at the patch. Looks good to me. +1

      Show
      Vinod Kumar Vavilapalli added a comment - Looked at the patch. Looks good to me. +1
      Hide
      Hadoop QA added a comment -

      -1 overall. Here are the results of testing the latest attachment
      http://issues.apache.org/jira/secure/attachment/12402624/patch-5374-2.txt
      against trunk revision 756752.

      +1 @author. The patch does not contain any @author tags.

      -1 tests included. The patch doesn't appear to include any new or modified tests.
      Please justify why no tests are needed for this patch.

      +1 javadoc. The javadoc tool did not generate any warning messages.

      +1 javac. The applied patch does not increase the total number of javac compiler warnings.

      +1 findbugs. The patch does not introduce any new Findbugs warnings.

      +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

      +1 release audit. The applied patch does not increase the total number of release audit warnings.

      +1 core tests. The patch passed core unit tests.

      +1 contrib tests. The patch passed contrib unit tests.

      Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/115/testReport/
      Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/115/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
      Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/115/artifact/trunk/build/test/checkstyle-errors.html
      Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/115/console

      This message is automatically generated.

      Show
      Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12402624/patch-5374-2.txt against trunk revision 756752. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/115/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/115/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/115/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/115/console This message is automatically generated.
      Hide
      Devaraj Das added a comment -

      I just committed this. Thanks, Amareshwari!

      Show
      Devaraj Das added a comment - I just committed this. Thanks, Amareshwari!
      Devaraj Das made changes -
      Status Patch Available [ 10002 ] Resolved [ 5 ]
      Hadoop Flags [Reviewed]
      Resolution Fixed [ 1 ]
      Hide
      Hudson added a comment -

      Integrated in Hadoop-trunk #790 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/790/)
      . Fixes a NPE problem in getTasksToSave method. Contributed by Amareshwari Sriramadasu.

      Show
      Hudson added a comment - Integrated in Hadoop-trunk #790 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/790/ ) . Fixes a NPE problem in getTasksToSave method. Contributed by Amareshwari Sriramadasu.
      Owen O'Malley made changes -
      Component/s mapred [ 12310690 ]
      Tom White made changes -
      Status Resolved [ 5 ] Closed [ 6 ]

        People

        • Assignee:
          Amareshwari Sriramadasu
          Reporter:
          Vinod Kumar Vavilapalli
        • Votes:
          0 Vote for this issue
          Watchers:
          1 Start watching this issue

          Dates

          • Created:
            Updated:
            Resolved:

            Development