Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-246

Job recovery should fail or kill a job that fails ACL checks upon restart, if the job was running previously

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Consider a scenario where a job was submitted to the M/R system and runs for a while. Then say the JT is restarted, and before that the ACLs for the user are changed so that that user can no longer submit jobs to that queue. Since the job could potentially be using resources alloted to that queue and could be account for it, this might lead to accounting inconsistencies. A suggestion is for the jobtracker to fail / kill this job.

      1. HADOOP-5460-v1.5.patch
        7 kB
        Amar Kamat
      2. HADOOP-5460-v1.4.patch
        6 kB
        Amar Kamat
      3. HADOOP-5460-v1.0.patch
        2 kB
        Amar Kamat
      4. HADOOP-5460-v1.0.patch
        6 kB
        Amar Kamat

        Issue Links

          Activity

          Hide
          Nigel Daley added a comment -

          Clearly there are a number of important test cases here that need consideration:

          Upon JT restart, these changes are made to ACLs and queues:
          1) user removed from all queues where her jobs are running
          2) user removed from one queue where her jobs are running
          3) user moved to a different queue
          4) queue renamed
          5) queue removed
          6) queue maxRunningJobs is smaller than number of currently running jobs
          ...

          Show
          Nigel Daley added a comment - Clearly there are a number of important test cases here that need consideration: Upon JT restart, these changes are made to ACLs and queues: 1) user removed from all queues where her jobs are running 2) user removed from one queue where her jobs are running 3) user moved to a different queue 4) queue renamed 5) queue removed 6) queue maxRunningJobs is smaller than number of currently running jobs ...
          Hide
          Amar Kamat added a comment -

          Attaching a quick fix for this issue. TestRecoveryManager is modified to take care of this.

          Show
          Amar Kamat added a comment - Attaching a quick fix for this issue. TestRecoveryManager is modified to take care of this.
          Hide
          Amar Kamat added a comment -

          Jobs that fail upon acls change on restart cannot be easily distinguished from jobs that failed acls on the previous jobtracker but failed to clean up. Staging will make this easier as jobs that were accepted will be moved to a different directory and hence should be blindly accepted upon restart.

          Show
          Amar Kamat added a comment - Jobs that fail upon acls change on restart cannot be easily distinguished from jobs that failed acls on the previous jobtracker but failed to clean up. Staging will make this easier as jobs that were accepted will be moved to a different directory and hence should be blindly accepted upon restart.
          Hide
          Amar Kamat added a comment -

          I think a job should not be failed or killed upon acls change on restart. Ideally whatever jobs are recovered should be allowed to continue. Attaching a patch that allows job to continue even if acls change across restart. Also jobs that fail in recovery are added to the system and then failed. Testing in progress.

          Show
          Amar Kamat added a comment - I think a job should not be failed or killed upon acls change on restart. Ideally whatever jobs are recovered should be allowed to continue. Attaching a patch that allows job to continue even if acls change across restart. Also jobs that fail in recovery are added to the system and then failed . Testing in progress.
          Hide
          Amar Kamat added a comment -

          Attaching a new patch that fixes a bug in the previous patch. Result of test-patch

          [exec] +1 overall.  
               [exec] 
               [exec]     +1 @author.  The patch does not contain any @author tags.
               [exec] 
               [exec]     +1 tests included.  The patch appears to include 3 new or modified tests.
               [exec] 
               [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
               [exec] 
               [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
               [exec] 
               [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
               [exec] 
               [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
               [exec] 
               [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.
          

          Running ant test now.

          Show
          Amar Kamat added a comment - Attaching a new patch that fixes a bug in the previous patch. Result of test-patch [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. Running ant test now.
          Hide
          Amar Kamat added a comment -

          Attaching a new patch. This patch makes sure that if a job gets created then its known to the jobtracker. Failure in recoverymanager will result in job failure. Result of test-patch

          [exec] +1 overall.  
               [exec] 
               [exec]     +1 @author.  The patch does not contain any @author tags.
               [exec] 
               [exec]     +1 tests included.  The patch appears to include 3 new or modified tests.
               [exec] 
               [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
               [exec] 
               [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
               [exec] 
               [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
               [exec] 
               [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
               [exec] 
               [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.
          

          Running ant test now.

          Show
          Amar Kamat added a comment - Attaching a new patch. This patch makes sure that if a job gets created then its known to the jobtracker. Failure in recoverymanager will result in job failure. Result of test-patch [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. Running ant test now.
          Hide
          Allen Wittenauer added a comment -

          I'm going to close this as stale.

          Show
          Allen Wittenauer added a comment - I'm going to close this as stale.

            People

            • Assignee:
              Unassigned
              Reporter:
              Hemanth Yamijala
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development