Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-4463

JobTracker recovery fails with HDFS permission issue

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 1.2.0
    • Fix Version/s: 1.2.0
    • Component/s: mrv1
    • Labels:
      None

      Description

      Recovery fails when the job user is different to the JT owner (i.e. on anything bigger than a pseudo-distributed cluster).

      1. MAPREDUCE-4463.patch
        8 kB
        Arun C Murthy
      2. MAPREDUCE-4463.patch
        8 kB
        Tom White
      3. MAPREDUCE-4463.patch
        7 kB
        Tom White

        Activity

        Hide
        Tom White added a comment -

        The stack trace is as follows:

        2012-07-18 11:26:05,846 INFO org.apache.hadoop.mapred.JobTracker: Starting the recovery process for 1 jobs ...
        2012-07-18 11:26:05,846 INFO org.apache.hadoop.mapred.JobTracker: Submitting job job_201207181120_0001
        2012-07-18 11:26:05,878 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:wypoon (auth:SIMPLE) cause:org.apache.hadoop.security.AccessControlException: org.apache.hadoop.security.AccessControlException: Permission denied: user=wypoon, access=EXECUTE, inode="/tmp/mapred/system":mapred:supergroup:drwx------
        2012-07-18 11:26:05,879 WARN org.apache.hadoop.mapred.JobTracker: Could not recover job job_201207181120_0001
        org.apache.hadoop.security.AccessControlException: org.apache.hadoop.security.AccessControlException: Permission denied: user=wypoon, access=EXECUTE, inode="/tmp/mapred/system":mapred:supergroup:drwx------
                at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
                at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
                at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
                at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
                at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95)
                at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:57)
                at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:914)
                at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:523)
                at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:820)
                at org.apache.hadoop.mapred.JobTracker$RecoveryManager$1.run(JobTracker.java:1523)
                at org.apache.hadoop.mapred.JobTracker$RecoveryManager$1.run(JobTracker.java:1519)
                at java.security.AccessController.doPrivileged(Native Method)
                at javax.security.auth.Subject.doAs(Subject.java:396)
                at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278)
        
                at org.apache.hadoop.mapred.JobTracker$RecoveryManager.recover(JobTracker.java:1519)
                at org.apache.hadoop.mapred.JobTracker.offerService(JobTracker.java:2134)
                at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:4493)
        Caused by: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.security.AccessControlException: Permission denied: user=wypoon, access=EXECUTE, inode="/tmp/mapred/system":mapred:supergroup:drwx------
                at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:203)
                at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:159)
                at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:125)
                at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:5253)
                at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkTraverse(FSNamesystem.java:5232)
                at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:2028)
                at org.apache.hadoop.hdfs.server.namenode.NameNode.getFileInfo(NameNode.java:897)
        ...
        

        This problem was found by Wing Yew Poon.

        Show
        Tom White added a comment - The stack trace is as follows: 2012-07-18 11:26:05,846 INFO org.apache.hadoop.mapred.JobTracker: Starting the recovery process for 1 jobs ... 2012-07-18 11:26:05,846 INFO org.apache.hadoop.mapred.JobTracker: Submitting job job_201207181120_0001 2012-07-18 11:26:05,878 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:wypoon (auth:SIMPLE) cause:org.apache.hadoop.security.AccessControlException: org.apache.hadoop.security.AccessControlException: Permission denied: user=wypoon, access=EXECUTE, inode="/tmp/mapred/system":mapred:supergroup:drwx------ 2012-07-18 11:26:05,879 WARN org.apache.hadoop.mapred.JobTracker: Could not recover job job_201207181120_0001 org.apache.hadoop.security.AccessControlException: org.apache.hadoop.security.AccessControlException: Permission denied: user=wypoon, access=EXECUTE, inode="/tmp/mapred/system":mapred:supergroup:drwx------ at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:57) at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:914) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:523) at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:820) at org.apache.hadoop.mapred.JobTracker$RecoveryManager$1.run(JobTracker.java:1523) at org.apache.hadoop.mapred.JobTracker$RecoveryManager$1.run(JobTracker.java:1519) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278) at org.apache.hadoop.mapred.JobTracker$RecoveryManager.recover(JobTracker.java:1519) at org.apache.hadoop.mapred.JobTracker.offerService(JobTracker.java:2134) at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:4493) Caused by: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.security.AccessControlException: Permission denied: user=wypoon, access=EXECUTE, inode="/tmp/mapred/system":mapred:supergroup:drwx------ at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:203) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:159) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:125) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:5253) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkTraverse(FSNamesystem.java:5232) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:2028) at org.apache.hadoop.hdfs.server.namenode.NameNode.getFileInfo(NameNode.java:897) ... This problem was found by Wing Yew Poon.
        Hide
        Tom White added a comment -

        The problem is that the recovery code tries to read the job token as the user who submitted the job, rather than the jobtracker user, yet the MR system directory has 700 perms so the user doesn't have read permission. The patch fixes this and adds a test case to run a job as a different user to the jobtracker. I also changed the test to use a mini DFS cluster for all tests (it was using the local filesystem before).

        Show
        Tom White added a comment - The problem is that the recovery code tries to read the job token as the user who submitted the job, rather than the jobtracker user, yet the MR system directory has 700 perms so the user doesn't have read permission. The patch fixes this and adds a test case to run a job as a different user to the jobtracker. I also changed the test to use a mini DFS cluster for all tests (it was using the local filesystem before).
        Hide
        Mayank Bansal added a comment -

        It's a good catch I missed it in my original patch.
        Thank's tom for fixing it.

        Looks good to me just one change I will suggest.

        In teardown method if we also can check dfs.isClusterUp()

        other then that this patch looks good

        +1

        Thanks,
        Mayank

        Show
        Mayank Bansal added a comment - It's a good catch I missed it in my original patch. Thank's tom for fixing it. Looks good to me just one change I will suggest. In teardown method if we also can check dfs.isClusterUp() other then that this patch looks good +1 Thanks, Mayank
        Hide
        Tom White added a comment -

        Thanks for reviewing the patch Mayank.

        I don't think checking for dfs.isClusterUp() in the test is needed, and in fact it will return false if the cluster is in safemode, in which case we would still want to call shutdown. Other tests just check that the instance variable is non-null.

        I've made a change to make the test more robust and also to guard against a race condition where the tasktracker reinitializes and old tasks that are still running interfere with new TIPs in recovered jobs.

        Show
        Tom White added a comment - Thanks for reviewing the patch Mayank. I don't think checking for dfs.isClusterUp() in the test is needed, and in fact it will return false if the cluster is in safemode, in which case we would still want to call shutdown. Other tests just check that the instance variable is non-null. I've made a change to make the test more robust and also to guard against a race condition where the tasktracker reinitializes and old tasks that are still running interfere with new TIPs in recovered jobs.
        Hide
        Mayank Bansal added a comment -

        Thats correct, we would want to shut down the cluster even if thats in safe mode.

        Thats a good catch to remove the TIPS which are from recovered jobs only.

        +1

        Thanks,
        Mayank

        Show
        Mayank Bansal added a comment - Thats correct, we would want to shut down the cluster even if thats in safe mode. Thats a good catch to remove the TIPS which are from recovered jobs only. +1 Thanks, Mayank
        Hide
        Alejandro Abdelnur added a comment -

        +1

        Show
        Alejandro Abdelnur added a comment - +1
        Hide
        Arun C Murthy added a comment -

        Tom, I'm not sure we need the check in the TT. The TT should actually 'kill' all it's tasks on a 'reinit' which it should get when the JT reboots.

        Show
        Arun C Murthy added a comment - Tom, I'm not sure we need the check in the TT. The TT should actually 'kill' all it's tasks on a 'reinit' which it should get when the JT reboots.
        Hide
        Tom White added a comment -

        Arun, I've seen an occasional race where the TT reinit actually causes removeTaskFromJob() to be called just after the equivalent recovered TIP has started, so the new TIP is erroneously removed from rjob.tasks. This code protects from that happening. Does that sound reasonable?

        Show
        Tom White added a comment - Arun, I've seen an occasional race where the TT reinit actually causes removeTaskFromJob() to be called just after the equivalent recovered TIP has started, so the new TIP is erroneously removed from rjob.tasks. This code protects from that happening. Does that sound reasonable?
        Hide
        Arun C Murthy added a comment -

        Tom - sorry, lost track of this. Is this good to go? Tx

        Show
        Arun C Murthy added a comment - Tom - sorry, lost track of this. Is this good to go? Tx
        Hide
        Arun C Murthy added a comment -

        Marking this as a blocker for 1.2. I'll run tests again and commit.

        Show
        Arun C Murthy added a comment - Marking this as a blocker for 1.2. I'll run tests again and commit.
        Hide
        Arun C Murthy added a comment -

        Tom White This patch seems wildly off for branch-1. Will you have time to rebase this? What revision did you patch this against? Tx.

        Show
        Arun C Murthy added a comment - Tom White This patch seems wildly off for branch-1. Will you have time to rebase this? What revision did you patch this against? Tx.
        Hide
        Arun C Murthy added a comment -

        Rebased - looks like MAPREDUCE-4859 already introduced the TaskTracker changes.

        Also, I made some simplifying changes to the new test in TestRecoveryManager - this way we don't start multiple MiniMRClusters via setup and clusterSetup methods in other tests in the same file.

        Show
        Arun C Murthy added a comment - Rebased - looks like MAPREDUCE-4859 already introduced the TaskTracker changes. Also, I made some simplifying changes to the new test in TestRecoveryManager - this way we don't start multiple MiniMRClusters via setup and clusterSetup methods in other tests in the same file.
        Hide
        Arun C Murthy added a comment -

        Since I haven't changed the meat of the patch which are straight-fwd and have been reviewed, I'll commit this shortly. Thanks.

        Show
        Arun C Murthy added a comment - Since I haven't changed the meat of the patch which are straight-fwd and have been reviewed, I'll commit this shortly. Thanks.
        Hide
        Arun C Murthy added a comment -

        I just committed this. Thanks Tom!

        Show
        Arun C Murthy added a comment - I just committed this. Thanks Tom!
        Hide
        Matt Foley added a comment -

        Closed upon release of Hadoop 1.2.0.

        Show
        Matt Foley added a comment - Closed upon release of Hadoop 1.2.0.

          People

          • Assignee:
            Tom White
            Reporter:
            Tom White
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development