Hadoop Common
  1. Hadoop Common
  2. HADOOP-5327

Job files for a job failing because of ACLs are not clean from the system directory

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.20.0
    • Fix Version/s: 0.20.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Jobs which failed because of ACLs gets added during JT restart recovery

      1. HADOOP-5327-v2.3.patch
        8 kB
        Amar Kamat
      2. HADOOP-5327-v2.6.patch
        10 kB
        Amar Kamat
      3. HADOOP-5327-v2.8.patch
        11 kB
        Amar Kamat

        Activity

        Hide
        Hudson added a comment -
        Show
        Hudson added a comment - Integrated in Hadoop-trunk #778 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/778/ )
        Hide
        Hemanth Yamijala added a comment -

        I just committed this to branch 0.20 and trunk. Thanks, Amar !

        Show
        Hemanth Yamijala added a comment - I just committed this to branch 0.20 and trunk. Thanks, Amar !
        Hide
        Amar Kamat added a comment -

        TestCLI failed on my box but also on trunk. Rest all tests passed.

        Show
        Amar Kamat added a comment - TestCLI failed on my box but also on trunk. Rest all tests passed.
        Hide
        Amar Kamat added a comment -

        Manually tested this patch on single node cluster

        1. invalid job submitted : system-dir cleaned up
        2. acls changed upon restart : files are retained but job is not accepted
        Show
        Amar Kamat added a comment - Manually tested this patch on single node cluster invalid job submitted : system-dir cleaned up acls changed upon restart : files are retained but job is not accepted
        Hide
        Hemanth Yamijala added a comment -

        Code changes look fine. +1.

        Show
        Hemanth Yamijala added a comment - Code changes look fine. +1.
        Hide
        Amar Kamat added a comment -

        Patch applies to 0.20 branch and relevant tests pass.

        Show
        Amar Kamat added a comment - Patch applies to 0.20 branch and relevant tests pass.
        Hide
        Amar Kamat added a comment -

        Attaching a patch incorporating Hemanth's comments. Result of test-patch

        [exec] +1 overall.  
             [exec] 
             [exec]     +1 @author.  The patch does not contain any @author tags.
             [exec] 
             [exec]     +1 tests included.  The patch appears to include 6 new or modified tests.
             [exec] 
             [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
             [exec] 
             [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
             [exec] 
             [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
             [exec] 
             [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
             [exec] 
             [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.
        
        Show
        Amar Kamat added a comment - Attaching a patch incorporating Hemanth's comments. Result of test-patch [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 6 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.
        Hide
        Hemanth Yamijala added a comment -

        Hi, looking good. Few comments:

        • I think we should not do the cleanup from the system directory in the job tracker restart case. Like I wrote earlier on, I think this should follow the same path as a job being killed. In that case, the system directory may be cleaned up as part of the job completion, and may be needed until then.
        • Can we rename getJobDirectory to getSystemJobDir() or something similar ?
        • I think we need a test case in TestQueueManager which will check that the system directory is being cleaned up upon unsuccessful job submission due to ACLs. The check for the system directory being deleted in TestRecoveryManager will not be needed.
        Show
        Hemanth Yamijala added a comment - Hi, looking good. Few comments: I think we should not do the cleanup from the system directory in the job tracker restart case. Like I wrote earlier on, I think this should follow the same path as a job being killed. In that case, the system directory may be cleaned up as part of the job completion, and may be needed until then. Can we rename getJobDirectory to getSystemJobDir() or something similar ? I think we need a test case in TestQueueManager which will check that the system directory is being cleaned up upon unsuccessful job submission due to ACLs. The check for the system directory being deleted in TestRecoveryManager will not be needed.
        Hide
        Amar Kamat added a comment -

        ant test passed on my box.

        Show
        Amar Kamat added a comment - ant test passed on my box.
        Hide
        Amar Kamat added a comment -

        Attaching a patch incorporating Hemanth's comments. Result of test-patch
        [code}
        [exec] +1 overall.
        [exec]
        [exec] +1 @author. The patch does not contain any @author tags.
        [exec]
        [exec] +1 tests included. The patch appears to include 3 new or modified tests.
        [exec]
        [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
        [exec]
        [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
        [exec]
        [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
        [exec]
        [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
        [exec]
        [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.

        
        
        Show
        Amar Kamat added a comment - Attaching a patch incorporating Hemanth's comments. Result of test-patch [code} [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.
        Hide
        Hemanth Yamijala added a comment -

        Amar, I suggested we cleanup the files here because:

        • The root cause of the issue is that job files are not deleted upon ACLs check failure.
        • Deleting it then will not even cause this problem in a very large percentage of cases.
        Show
        Hemanth Yamijala added a comment - Amar, I suggested we cleanup the files here because: The root cause of the issue is that job files are not deleted upon ACLs check failure. Deleting it then will not even cause this problem in a very large percentage of cases.
        Hide
        Amar Kamat added a comment -

        Job directory cleanup upon submission failure can be addressed in another jira.

        The job was running in the first place, and after restart it can no longer run because the ACLs were changed.

        HADOOP-4638 and HADOOP-5392 will take care of this. In such cases, the job will not be added and hence a KILL-JOB will be issued to all the trackers.

        Show
        Amar Kamat added a comment - Job directory cleanup upon submission failure can be addressed in another jira. The job was running in the first place, and after restart it can no longer run because the ACLs were changed. HADOOP-4638 and HADOOP-5392 will take care of this. In such cases, the job will not be added and hence a KILL-JOB will be issued to all the trackers.
        Hide
        Hemanth Yamijala added a comment -

        I think this patch should also add the system directory to the clean up thread in the code path where job submission fails due to ACLs. In a majority of the cases, this action alone will prevent the problem from happening in the first place. However, this is only in addition to the changes in the patch as they are still needed to take care of cases where the job tracker could be restarted before the clean up thread has had a chance to delete the system directory completely.

        Regarding cleanup, there seem to be two different cases here:

        • The job was never submitted in the first place
        • The job was running in the first place, and after restart it can no longer run because the ACLs were changed.

        I think the patch is cleanly handling the first case (with the comments incorporated). In the second case, ideally the job should be killed by the JobTracker so that all parts related to the job (system directory, running tasks, cleanup task, etc) are cleaned up properly. I am thinking handling the second case (which ideally should be rare) should be a separate jira. Thoughts ?

        Show
        Hemanth Yamijala added a comment - I think this patch should also add the system directory to the clean up thread in the code path where job submission fails due to ACLs. In a majority of the cases, this action alone will prevent the problem from happening in the first place. However, this is only in addition to the changes in the patch as they are still needed to take care of cases where the job tracker could be restarted before the clean up thread has had a chance to delete the system directory completely. Regarding cleanup, there seem to be two different cases here: The job was never submitted in the first place The job was running in the first place, and after restart it can no longer run because the ACLs were changed. I think the patch is cleanly handling the first case (with the comments incorporated). In the second case, ideally the job should be killed by the JobTracker so that all parts related to the job (system directory, running tasks, cleanup task, etc) are cleaned up properly. I am thinking handling the second case (which ideally should be rare) should be a separate jira. Thoughts ?
        Hide
        Hemanth Yamijala added a comment -

        Few comments:

        • There's an import of a LoginException in JobTracker, which is not required.
        • I feel it would be better to overload checkAccess to have an additional parameter UGI. This is because both modes (with and without UGI) are still valid and hence it would be more convenient and fewer changes to code if we can create the overloaded API.
        • Can you please correct the comment on checkAccess, to remove references to ownerAllowed, as it is no longer used. This is not introduced in this patch, but since you are modifying the comment, it may make sense to correct this as well.
        • In the test cases, why is dfs.permissions set to false ? Is it because the credentials of the new job (with invalid ACLs) is being submitted as a different user and hence there will be problems to write to dfs ? Does this happen, even though you are not using the MiniDFSCluster ?
        • A general question is about whether any cleanup is required for the job which fails ACLs. For instance, should we delete the job folder in the system directory, delete the history files etc ? I am not sure if it is required. Any thoughts ?
        Show
        Hemanth Yamijala added a comment - Few comments: There's an import of a LoginException in JobTracker, which is not required. I feel it would be better to overload checkAccess to have an additional parameter UGI. This is because both modes (with and without UGI) are still valid and hence it would be more convenient and fewer changes to code if we can create the overloaded API. Can you please correct the comment on checkAccess, to remove references to ownerAllowed, as it is no longer used. This is not introduced in this patch, but since you are modifying the comment, it may make sense to correct this as well. In the test cases, why is dfs.permissions set to false ? Is it because the credentials of the new job (with invalid ACLs) is being submitted as a different user and hence there will be problems to write to dfs ? Does this happen, even though you are not using the MiniDFSCluster ? A general question is about whether any cleanup is required for the job which fails ACLs. For instance, should we delete the job folder in the system directory, delete the history files etc ? I am not sure if it is required. Any thoughts ?
        Hide
        Amar Kamat added a comment -

        Attaching a patch that checks the acls in recovery before adding the job. Result of test-patch

        [exec] +1 overall.  
             [exec] 
             [exec]     +1 @author.  The patch does not contain any @author tags.
             [exec] 
             [exec]     +1 tests included.  The patch appears to include 3 new or modified tests.
             [exec] 
             [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
             [exec] 
             [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
             [exec] 
             [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
             [exec] 
             [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
             [exec] 
             [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.
        
        Show
        Amar Kamat added a comment - Attaching a patch that checks the acls in recovery before adding the job. Result of test-patch [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.
        Hide
        Nigel Daley added a comment -

        More details on HADOOP-5400 which was closed as a dup of this bug.

        Show
        Nigel Daley added a comment - More details on HADOOP-5400 which was closed as a dup of this bug.

          People

          • Assignee:
            Amar Kamat
            Reporter:
            Karam Singh
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development