Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-1707

TaskRunner can get NPE in getting ugi from TaskTracker

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.22.0
    • Fix Version/s: 0.22.0
    • Component/s: tasktracker
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Fixed a bug that causes TaskRunner to get NPE in getting ugi from TaskTracker and subsequently crashes it resulting in a failing task after task-timeout period.

      Description

      The following code in TaskRunner can get NPE in the scenario described below.

            UserGroupInformation ugi = 
              tracker.getRunningJob(t.getJobID()).getUGI();
      

      The scenario:
      Tracker got a LaunchTaskAction; Task is localized and TaskRunner is started.
      Then Tracker got a KillJobAction; This would issue a kill for the task. But, kill will be a no-op because the task did not actually start; The job is removed from runningJobs.
      Then if TaskRunner calls tracker.getRunningJob(t.getJobID()), it will be null.

      Instead of TaskRunner doing a back call to tasktracker to get the ugi, tracker.getRunningJob(t.getJobID()).getUGI(), ugi should be passed a parameter in the constructor of TaskRunner.

      1. MAPREDUCE-1707-20100504-ydist.txt
        3 kB
        Vinod Kumar Vavilapalli
      2. MAPREDUCE-1707-20100430.txt
        3 kB
        Vinod Kumar Vavilapalli
      3. MAPREDUCE-1707-20100429.txt
        2 kB
        Vinod Kumar Vavilapalli

        Issue Links

          Activity

          Hide
          Vinod Kumar Vavilapalli added a comment -

          Instead of TaskRunner doing a back call to tasktracker to get the ugi, tracker.getRunningJob(t.getJobID()).getUGI(), ugi should be passed a parameter in the constructor of TaskRunner.

          Passing UGI all the way down to TaskRunner both looked wierd and resulted in ugly code changes. Instead, I am making the TaskRunner simply check if the the returned RunningJob null, i.e if the job is killed already, and skip the localization of dist-cache files.

          Long term, we need to cleanup the whole mess with the TaskTracker, TaskRunner and JvmManger interactions. The code is simply-not-maintainable in this form.

          Show
          Vinod Kumar Vavilapalli added a comment - Instead of TaskRunner doing a back call to tasktracker to get the ugi, tracker.getRunningJob(t.getJobID()).getUGI(), ugi should be passed a parameter in the constructor of TaskRunner. Passing UGI all the way down to TaskRunner both looked wierd and resulted in ugly code changes. Instead, I am making the TaskRunner simply check if the the returned RunningJob null, i.e if the job is killed already, and skip the localization of dist-cache files. Long term, we need to cleanup the whole mess with the TaskTracker, TaskRunner and JvmManger interactions. The code is simply-not-maintainable in this form.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Trunk patch for this issue and MAPREDUCE-1703.

          A race condition really. Cannot write useful test-cases without aggressive refactoring of TaskTracker/TaskRunner.

          Amareshwari, can you please look at this?

          Show
          Vinod Kumar Vavilapalli added a comment - Trunk patch for this issue and MAPREDUCE-1703 . A race condition really. Cannot write useful test-cases without aggressive refactoring of TaskTracker/TaskRunner. Amareshwari, can you please look at this?
          Hide
          Amareshwari Sriramadasu added a comment -

          Patch skips localizing distributed cache and continues further, if getRunningJob() returns null. I think it should not any more processing. It should just return from there.

          Passing UGI all the way down to TaskRunner both looked wierd and resulted in ugly code changes.

          I still feel TaskRunner should not do a back call to TaskTracker. Can we pass UGI as part of Task object to make things simpler?

          Show
          Amareshwari Sriramadasu added a comment - Patch skips localizing distributed cache and continues further, if getRunningJob() returns null. I think it should not any more processing. It should just return from there. Passing UGI all the way down to TaskRunner both looked wierd and resulted in ugly code changes. I still feel TaskRunner should not do a back call to TaskTracker. Can we pass UGI as part of Task object to make things simpler?
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Can we pass UGI as part of Task object to make things simpler?

          I give in, TaskTracker setin in TaskInProgress just before the TaskRunner is going to be launched. By this time, job localization happens already and so tip.getUGI() can't be null at all.

          Show
          Vinod Kumar Vavilapalli added a comment - Can we pass UGI as part of Task object to make things simpler? I give in, TaskTracker setin in TaskInProgress just before the TaskRunner is going to be launched. By this time, job localization happens already and so tip.getUGI() can't be null at all.
          Hide
          Amareshwari Sriramadasu added a comment -

          Changes look fine to me. Submitting for hudson.

          Show
          Amareshwari Sriramadasu added a comment - Changes look fine to me. Submitting for hudson.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12443273/MAPREDUCE-1707-20100430.txt
          against trunk revision 940364.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/161/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/161/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/161/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/161/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12443273/MAPREDUCE-1707-20100430.txt against trunk revision 940364. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/161/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/161/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/161/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/161/console This message is automatically generated.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          As noted before, the patch adds no new testcases as it fixes a rare race condition and cannot write useful test-cases without aggressive refactoring of TaskTracker/TaskRunner.

          The testcase failure for TestJobOutputCommitter should be because of MAPREDUCE-1275. It passed on my machine.

          Am going to commit this to trunk.

          Show
          Vinod Kumar Vavilapalli added a comment - As noted before, the patch adds no new testcases as it fixes a rare race condition and cannot write useful test-cases without aggressive refactoring of TaskTracker/TaskRunner. The testcase failure for TestJobOutputCommitter should be because of MAPREDUCE-1275 . It passed on my machine. Am going to commit this to trunk.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          I just committed this to trunk.

          Show
          Vinod Kumar Vavilapalli added a comment - I just committed this to trunk.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Patch for yahoo dist security branch over 20. Not for commit here.

          Show
          Vinod Kumar Vavilapalli added a comment - Patch for yahoo dist security branch over 20. Not for commit here.

            People

            • Assignee:
              Vinod Kumar Vavilapalli
              Reporter:
              Amareshwari Sriramadasu
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development