Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-3728

ShuffleHandler can't access results when configured in a secure mode

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 0.23.0
    • Fix Version/s: 0.23.3, 2.0.2-alpha
    • Component/s: mrv2, nodemanager
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      While running the simplest of jobs (Pi) on MR2 in a fully secure configuration I have noticed that the job was failing on the reduce side with the following messages littering the nodemanager logs:

      2012-01-19 08:35:32,544 ERROR org.apache.hadoop.mapred.ShuffleHandler: Shuffle error
      org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find usercache/rvs/appcache/application_1326928483038_0001/output/attempt_1326928483038_0001_m_000003_0/file.out.index in any of the configured local directories
      

      While digging further I found out that the permissions on the files/dirs were prohibiting nodemanager (running under the user yarn) to access these files:

      $ ls -l /data/3/yarn/usercache/testuser/appcache/application_1327102703969_0001/output/attempt_1327102703969_0001_m_000001_0
      -rw-r----- 1 testuser testuser 28 Jan 20 15:41 file.out
      -rw-r----- 1 testuser testuser 32 Jan 20 15:41 file.out.index
      

      Digging even further revealed that the group-sticky bit that was faithfully put on all the subdirectories between testuser and application_1327102703969_0001 was gone from output and attempt_1327102703969_0001_m_000001_0.

      Looking into how these subdirectories are created (org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.initDirs())

            // $x/usercache/$user/appcache/$appId/filecache
            Path appFileCacheDir = new Path(appBase, FILECACHE);
            appsFileCacheDirs[i] = appFileCacheDir.toString();
            lfs.mkdir(appFileCacheDir, null, false);
            // $x/usercache/$user/appcache/$appId/output
            lfs.mkdir(new Path(appBase, OUTPUTDIR), null, false);
      

      Reveals that lfs.mkdir ends up manipulating permissions and thus clears sticky bit from output and filecache.

      At this point I'm at a loss about how this is supposed to work. My understanding was
      that the whole sequence of events here was predicated on a sticky bit set so
      that daemons running under the user yarn (default group yarn) can have access
      to the resulting files and subdirectories down at output and below. Please let
      me know if I'm missing something or whether this is just a bug that needs to be fixed.

      On a related note, when the shuffle side of the Pi job failed the job itself didn't.
      It went into the endless loop and only exited when it exhausted all the local storage
      for the log files (at which point the nodemanager died and thus the job ended). Perhaps
      this is even more serious side effect of this issue that needs to be investigated
      separately.

        Activity

        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-0.23-Build #310 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/310/)
        svn merge -c 1295245 FIXES: MAPREDUCE-3728. ShuffleHandler can't access results when configured in a secure mode (ahmed via tucu) (Revision 1359724)

        Result = UNSTABLE
        bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1359724
        Files :

        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ContainerLocalizer.java
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestContainerLocalizer.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-0.23-Build #310 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/310/ ) svn merge -c 1295245 FIXES: MAPREDUCE-3728 . ShuffleHandler can't access results when configured in a secure mode (ahmed via tucu) (Revision 1359724) Result = UNSTABLE bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1359724 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ContainerLocalizer.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestContainerLocalizer.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk #1006 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1006/)
        MAPREDUCE-3728. ShuffleHandler can't access results when configured in a secure mode (ahmed via tucu) (Revision 1295245)

        Result = SUCCESS
        tucu : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1295245
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ContainerLocalizer.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestContainerLocalizer.java
        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #1006 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1006/ ) MAPREDUCE-3728 . ShuffleHandler can't access results when configured in a secure mode (ahmed via tucu) (Revision 1295245) Result = SUCCESS tucu : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1295245 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ContainerLocalizer.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestContainerLocalizer.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-trunk #971 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/971/)
        MAPREDUCE-3728. ShuffleHandler can't access results when configured in a secure mode (ahmed via tucu) (Revision 1295245)

        Result = SUCCESS
        tucu : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1295245
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ContainerLocalizer.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestContainerLocalizer.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #971 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/971/ ) MAPREDUCE-3728 . ShuffleHandler can't access results when configured in a secure mode (ahmed via tucu) (Revision 1295245) Result = SUCCESS tucu : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1295245 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ContainerLocalizer.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestContainerLocalizer.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk-Commit #1818 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/1818/)
        MAPREDUCE-3728. ShuffleHandler can't access results when configured in a secure mode (ahmed via tucu) (Revision 1295245)

        Result = ABORTED
        tucu : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1295245
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ContainerLocalizer.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestContainerLocalizer.java
        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #1818 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/1818/ ) MAPREDUCE-3728 . ShuffleHandler can't access results when configured in a secure mode (ahmed via tucu) (Revision 1295245) Result = ABORTED tucu : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1295245 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ContainerLocalizer.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestContainerLocalizer.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-0.23-Commit #620 (See https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Commit/620/)
        Merge -r 1295244:1295245 from trunk to branch. FIXES: MAPREDUCE-3728 (Revision 1295246)

        Result = ABORTED
        tucu : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1295246
        Files :

        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ContainerLocalizer.java
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestContainerLocalizer.java
        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-0.23-Commit #620 (See https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Commit/620/ ) Merge -r 1295244:1295245 from trunk to branch. FIXES: MAPREDUCE-3728 (Revision 1295246) Result = ABORTED tucu : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1295246 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ContainerLocalizer.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestContainerLocalizer.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-0.23-Commit #609 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Commit/609/)
        Merge -r 1295244:1295245 from trunk to branch. FIXES: MAPREDUCE-3728 (Revision 1295246)

        Result = SUCCESS
        tucu : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1295246
        Files :

        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ContainerLocalizer.java
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestContainerLocalizer.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-0.23-Commit #609 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Commit/609/ ) Merge -r 1295244:1295245 from trunk to branch. FIXES: MAPREDUCE-3728 (Revision 1295246) Result = SUCCESS tucu : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1295246 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ContainerLocalizer.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestContainerLocalizer.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-trunk-Commit #1886 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/1886/)
        MAPREDUCE-3728. ShuffleHandler can't access results when configured in a secure mode (ahmed via tucu) (Revision 1295245)

        Result = SUCCESS
        tucu : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1295245
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ContainerLocalizer.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestContainerLocalizer.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #1886 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/1886/ ) MAPREDUCE-3728 . ShuffleHandler can't access results when configured in a secure mode (ahmed via tucu) (Revision 1295245) Result = SUCCESS tucu : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1295245 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ContainerLocalizer.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestContainerLocalizer.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Common-0.23-Commit #620 (See https://builds.apache.org/job/Hadoop-Common-0.23-Commit/620/)
        Merge -r 1295244:1295245 from trunk to branch. FIXES: MAPREDUCE-3728 (Revision 1295246)

        Result = SUCCESS
        tucu : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1295246
        Files :

        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ContainerLocalizer.java
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestContainerLocalizer.java
        Show
        Hudson added a comment - Integrated in Hadoop-Common-0.23-Commit #620 (See https://builds.apache.org/job/Hadoop-Common-0.23-Commit/620/ ) Merge -r 1295244:1295245 from trunk to branch. FIXES: MAPREDUCE-3728 (Revision 1295246) Result = SUCCESS tucu : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1295246 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ContainerLocalizer.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestContainerLocalizer.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Common-trunk-Commit #1811 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/1811/)
        MAPREDUCE-3728. ShuffleHandler can't access results when configured in a secure mode (ahmed via tucu) (Revision 1295245)

        Result = SUCCESS
        tucu : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1295245
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ContainerLocalizer.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestContainerLocalizer.java
        Show
        Hudson added a comment - Integrated in Hadoop-Common-trunk-Commit #1811 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/1811/ ) MAPREDUCE-3728 . ShuffleHandler can't access results when configured in a secure mode (ahmed via tucu) (Revision 1295245) Result = SUCCESS tucu : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1295245 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ContainerLocalizer.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestContainerLocalizer.java
        Hide
        Alejandro Abdelnur added a comment -

        Thanks Ahmed. Committed to trunk and branch-0.23

        Show
        Alejandro Abdelnur added a comment - Thanks Ahmed. Committed to trunk and branch-0.23
        Hide
        Alejandro Abdelnur added a comment -

        +1. I cannot think of any side effect by this change. Please open a JIRA against the Local impl of the LoginContext regarding the lost of the gid.

        Show
        Alejandro Abdelnur added a comment - +1. I cannot think of any side effect by this change. Please open a JIRA against the Local impl of the LoginContext regarding the lost of the gid.
        Hide
        Ahmed Radwan added a comment -

        As I mentioned earlier, it seems that the problem is mainly on how the output directory is created in ContainerLocalizer.java. The setgid is lost in the process. I have even tried to set the permissions explicitly and setting setgid but also this doesn't seem to solve the issue.

        So I think the current patch, where the output directory creation is postponed is appropriate as it is correctly created afterwards by FileSystem.create(). We may also need to investigate more why mkdir in AbstractFileSystem in this case is causing the loss of setgid, and see if this a bug or it is an intentional behavior.

        Show
        Ahmed Radwan added a comment - As I mentioned earlier, it seems that the problem is mainly on how the output directory is created in ContainerLocalizer.java. The setgid is lost in the process. I have even tried to set the permissions explicitly and setting setgid but also this doesn't seem to solve the issue. So I think the current patch, where the output directory creation is postponed is appropriate as it is correctly created afterwards by FileSystem.create(). We may also need to investigate more why mkdir in AbstractFileSystem in this case is causing the loss of setgid, and see if this a bug or it is an intentional behavior.
        Hide
        Roman Shaposhnik added a comment -

        This patch seems to be working in my case. I'd recommend including it in the trunk and .23 branch.

        Show
        Roman Shaposhnik added a comment - This patch seems to be working in my case. I'd recommend including it in the trunk and .23 branch.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12515738/MAPREDUCE-3728.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in .

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1914//testReport/
        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1914//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12515738/MAPREDUCE-3728.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in . +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1914//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1914//console This message is automatically generated.
        Hide
        Ahmed Radwan added a comment -

        Thanks Roman! I was able to reproduce this issue on a secure cluster.

        Here is more explanation of what is happening: For each application, the $x/usercache/$user/appcache/$appId directory has the setgid permission, the intention here is to have files and sub-directories created within it to inherit its group ID (i.e., hadoop), and not the primary group ID of the user who created the files/sub-directories. However the explicit creation of $x/usercache/$user/appcache/$appId/output in ContainerLocalizer.java:

        lfs.mkdir(new Path(appBase, OUTPUTDIR), null, false);
        

        causes the "output" directory to lose the setguid, and accordingly the attempts' outputs created inside it has the user's group ID, and hence user "yarn" is now unable to access these map outputs. Note that, yarn has group ID "hadoop", so if setguid was preserved, then it would have been able to access the files.

        The attached patch addresses this issue, it simply avoids this way of creation of the output directory, as it is later correctly created when map outputs are written.

        I have tested it by running example pi and wordcount jobs on a single-node secure cluster, and also on an insecure single-node cluster.

        Show
        Ahmed Radwan added a comment - Thanks Roman! I was able to reproduce this issue on a secure cluster. Here is more explanation of what is happening: For each application, the $x/usercache/$user/appcache/$appId directory has the setgid permission, the intention here is to have files and sub-directories created within it to inherit its group ID (i.e., hadoop), and not the primary group ID of the user who created the files/sub-directories. However the explicit creation of $x/usercache/$user/appcache/$appId/output in ContainerLocalizer.java: lfs.mkdir( new Path(appBase, OUTPUTDIR), null , false ); causes the "output" directory to lose the setguid, and accordingly the attempts' outputs created inside it has the user's group ID, and hence user "yarn" is now unable to access these map outputs. Note that, yarn has group ID "hadoop", so if setguid was preserved, then it would have been able to access the files. The attached patch addresses this issue, it simply avoids this way of creation of the output directory, as it is later correctly created when map outputs are written. I have tested it by running example pi and wordcount jobs on a single-node secure cluster, and also on an insecure single-node cluster.
        Hide
        Roman Shaposhnik added a comment -

        Here's a more direct way to reproduce the problem.

        # sudo - yarn
        yarn$ mkdir -p /tmp/TEST/{logs,locs} /tmp/TEST/locs/usercache
        yarn$ cp /tmp/cont1.tokens /tmp/TEST/cont1.tokens
        yarn$ container-executor rvs 0 app1 /tmp/TEST/cont1.tokens /tmp/TEST/locs /tmp/TEST/logs /usr/java/jdk1.6.0_26/jre/bin/java -classpath /usr/lib/hadoop/lib/\*:/usr/lib/hadoop/\*:/etc/hadoop/conf/nm-config/log4j.properties:/etc/hadoop/conf -Djava.library.path=/usr/lib/hadoop/lib/native org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer rvs app1 cont1 0.0.0.0 4344 /tmp/TEST/locs 
        
        main : command provided 0
        main : user is rvs
        12/02/09 11:54:40 WARN conf.Configuration: mapred-site.xml:an attempt to override final parameter: mapreduce.cluster.local.dir;  Ignoring.
        12/02/09 11:54:40 WARN conf.Configuration: mapred-site.xml:an attempt to override final parameter: mapreduce.cluster.local.dir;  Ignoring.
        12/02/09 11:54:40 WARN conf.Configuration: mapred-site.xml:an attempt to override final parameter: mapreduce.cluster.local.dir;  Ignoring.
        12/02/09 11:54:40 INFO util.NativeCodeLoader: Loaded the native-hadoop library
        12/02/09 11:54:40 WARN conf.Configuration: mapred-site.xml:an attempt to override final parameter: mapreduce.cluster.local.dir;  Ignoring.
        12/02/09 11:54:40 WARN conf.Configuration: mapred-site.xml:an attempt to override final parameter: mapreduce.cluster.local.dir;  Ignoring.
        12/02/09 11:54:40 WARN conf.Configuration: mapred-site.xml:an attempt to override final parameter: mapreduce.cluster.local.dir;  Ignoring.
        =========== Using localizerTokenSecurityInfo12/02/09 11:54:41 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:4344. Already tried 0 time(s).
        12/02/09 11:54:42 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:4344. Already tried 1 time(s).
        12/02/09 11:54:43 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:4344. Already tried 2 time(s).
        12/02/09 11:54:44 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:4344. Already tried 3 time(s).
        12/02/09 11:54:45 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:4344. Already tried 4 time(s).
        12/02/09 11:54:46 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:4344. Already tried 5 time(s).
        12/02/09 11:54:47 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:4344. Already tried 6 time(s).
        12/02/09 11:54:48 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:4344. Already tried 7 time(s).
        12/02/09 11:54:49 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:4344. Already tried 8 time(s).
        12/02/09 11:54:50 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:4344. Already tried 9 time(s).
        java.lang.reflect.UndeclaredThrowableException
        	at org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.client.LocalizationProtocolPBClientImpl.heartbeat(LocalizationProtocolPBClientImpl.java:62)
        	at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.localizeFiles(ContainerLocalizer.java:221)
        	at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.runLocalization(ContainerLocalizer.java:169)
        	at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.main(ContainerLocalizer.java:345)
        Caused by: com.google.protobuf.ServiceException: java.net.ConnectException: Call From c0506.hal.cloudera.com/172.29.81.158 to 0.0.0.0:4344 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
        	at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:148)
        	at $Proxy6.heartbeat(Unknown Source)
        	at org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.client.LocalizationProtocolPBClientImpl.heartbeat(LocalizationProtocolPBClientImpl.java:54)
        	... 3 more
        Caused by: java.net.ConnectException: Call From c0506.hal.cloudera.com/172.29.81.158 to 0.0.0.0:4344 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
        	at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:686)
        	at org.apache.hadoop.ipc.Client.call(Client.java:1141)
        	at org.apache.hadoop.ipc.Client.call(Client.java:1100)
        	at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:145)
        	... 5 more
        Caused by: java.net.ConnectException: Connection refused
        	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
        	at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
        	at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:488)
        	at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:469)
        	at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:563)
        	at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:211)
        	at org.apache.hadoop.ipc.Client.getConnection(Client.java:1247)
        	at org.apache.hadoop.ipc.Client.call(Client.java:1117)
        	... 7 more
        

        as you can see the localication process got to a point where it was
        trying to fetch the files for localization. This means that it has
        completed all the filesystem manipulations. Next step would have been
        launching a container under the user id of 'rvs'. So lets see where
        that container would have put its intermediate results:

          yarn$ ls -ld /tmp/TEST/locs/usercache/rvs/appcache/app1/output/
          drwxr-xr-x 2 rvs yarn 4096 Feb  9 11:54 /tmp/TEST/locs/usercache/rvs/appcache/app1/output/
        

        quite naturally, given that the sticky bit is no longer present on the output dir
        AND the fact that user rvs has a group of rvs as the default group the resulting
        files are totally out of reach for something running under the yarn account.

        Show
        Roman Shaposhnik added a comment - Here's a more direct way to reproduce the problem. # sudo - yarn yarn$ mkdir -p /tmp/TEST/{logs,locs} /tmp/TEST/locs/usercache yarn$ cp /tmp/cont1.tokens /tmp/TEST/cont1.tokens yarn$ container-executor rvs 0 app1 /tmp/TEST/cont1.tokens /tmp/TEST/locs /tmp/TEST/logs /usr/java/jdk1.6.0_26/jre/bin/java -classpath /usr/lib/hadoop/lib/\*:/usr/lib/hadoop/\*:/etc/hadoop/conf/nm-config/log4j.properties:/etc/hadoop/conf -Djava.library.path=/usr/lib/hadoop/lib/native org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer rvs app1 cont1 0.0.0.0 4344 /tmp/TEST/locs main : command provided 0 main : user is rvs 12/02/09 11:54:40 WARN conf.Configuration: mapred-site.xml:an attempt to override final parameter: mapreduce.cluster.local.dir; Ignoring. 12/02/09 11:54:40 WARN conf.Configuration: mapred-site.xml:an attempt to override final parameter: mapreduce.cluster.local.dir; Ignoring. 12/02/09 11:54:40 WARN conf.Configuration: mapred-site.xml:an attempt to override final parameter: mapreduce.cluster.local.dir; Ignoring. 12/02/09 11:54:40 INFO util.NativeCodeLoader: Loaded the native-hadoop library 12/02/09 11:54:40 WARN conf.Configuration: mapred-site.xml:an attempt to override final parameter: mapreduce.cluster.local.dir; Ignoring. 12/02/09 11:54:40 WARN conf.Configuration: mapred-site.xml:an attempt to override final parameter: mapreduce.cluster.local.dir; Ignoring. 12/02/09 11:54:40 WARN conf.Configuration: mapred-site.xml:an attempt to override final parameter: mapreduce.cluster.local.dir; Ignoring. =========== Using localizerTokenSecurityInfo12/02/09 11:54:41 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:4344. Already tried 0 time(s). 12/02/09 11:54:42 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:4344. Already tried 1 time(s). 12/02/09 11:54:43 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:4344. Already tried 2 time(s). 12/02/09 11:54:44 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:4344. Already tried 3 time(s). 12/02/09 11:54:45 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:4344. Already tried 4 time(s). 12/02/09 11:54:46 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:4344. Already tried 5 time(s). 12/02/09 11:54:47 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:4344. Already tried 6 time(s). 12/02/09 11:54:48 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:4344. Already tried 7 time(s). 12/02/09 11:54:49 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:4344. Already tried 8 time(s). 12/02/09 11:54:50 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:4344. Already tried 9 time(s). java.lang.reflect.UndeclaredThrowableException at org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.client.LocalizationProtocolPBClientImpl.heartbeat(LocalizationProtocolPBClientImpl.java:62) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.localizeFiles(ContainerLocalizer.java:221) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.runLocalization(ContainerLocalizer.java:169) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.main(ContainerLocalizer.java:345) Caused by: com.google.protobuf.ServiceException: java.net.ConnectException: Call From c0506.hal.cloudera.com/172.29.81.158 to 0.0.0.0:4344 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:148) at $Proxy6.heartbeat(Unknown Source) at org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.client.LocalizationProtocolPBClientImpl.heartbeat(LocalizationProtocolPBClientImpl.java:54) ... 3 more Caused by: java.net.ConnectException: Call From c0506.hal.cloudera.com/172.29.81.158 to 0.0.0.0:4344 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:686) at org.apache.hadoop.ipc.Client.call(Client.java:1141) at org.apache.hadoop.ipc.Client.call(Client.java:1100) at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:145) ... 5 more Caused by: java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:488) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:469) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:563) at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:211) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1247) at org.apache.hadoop.ipc.Client.call(Client.java:1117) ... 7 more as you can see the localication process got to a point where it was trying to fetch the files for localization. This means that it has completed all the filesystem manipulations. Next step would have been launching a container under the user id of 'rvs'. So lets see where that container would have put its intermediate results: yarn$ ls -ld /tmp/TEST/locs/usercache/rvs/appcache/app1/output/ drwxr-xr-x 2 rvs yarn 4096 Feb 9 11:54 /tmp/TEST/locs/usercache/rvs/appcache/app1/output/ quite naturally, given that the sticky bit is no longer present on the output dir AND the fact that user rvs has a group of rvs as the default group the resulting files are totally out of reach for something running under the yarn account.
        Hide
        Eli Collins added a comment -

        Bumping priority, seems pretty critical that MR2 work with security. MR guys, is there something we're missing?

        Show
        Eli Collins added a comment - Bumping priority, seems pretty critical that MR2 work with security. MR guys, is there something we're missing?
        Hide
        Roman Shaposhnik added a comment -

        .bq May be I am missing something, but I don't understand this. The output-dir is being created with null FilePermissions which becomes 0777.

        I suspect this is not about what permissions it is create with, but the sheer fact that permissions are set at all. Here's an example of how it goes wrong when permissions are set explicitly. When that happens sticky bit gets lost.
        Here's a trivial example. As you can see, when a subdir is created it is, at first, retains the sticky bit. But when
        I set permissions explicitly it gets cleared (as it should):

        rvs@ahmed-laptop:/tmp$ cd /tmp
        rvs@ahmed-laptop:/tmp$ mkdir sticky.dir
        rvs@ahmed-laptop:/tmp$ sudo chgrp root sticky.dir 
        rvs@ahmed-laptop:/tmp$ sudo chmod g+s sticky.dir
        rvs@ahmed-laptop:/tmp$ mkdir sticky.dir/subdir
        rvs@ahmed-laptop:/tmp$ ls -l sticky.dir
        total 4
        drwxr-sr-x 2 rvs root 4096 2012-01-25 13:50 subdir
        rvs@ahmed-laptop:/tmp$ chmod 777 sticky.dir/subdir
        rvs@ahmed-laptop:/tmp$ ls -l sticky.dir
        total 4
        drwxrwxrwx 2 rvs root 4096 2012-01-25 13:50 subdir
        

        .bq Instead I suspect that the umask for your testuser is strict. That my be the reason why the files are getting created with 640. Can you please check?

        The problem is not file permissions per se, but the fact that subdirs (starting from output) have lost a group sticky bit and thus everything that is created underneath them has user group as a default (and of course no longer retains the sticky group bit on the subdirectories). But anyway, I've checked and the umask for testuser is 0022

        .bq Also, just to be sure, you are using DefaultContainerExecutor or LinuxContainerExecutor? It is mostly the later, but just confirming.

        This is LinuxContainerExecutor since, to the best of my knowledge, one has to configure it for the fully secure cluster.

        Show
        Roman Shaposhnik added a comment - .bq May be I am missing something, but I don't understand this. The output-dir is being created with null FilePermissions which becomes 0777. I suspect this is not about what permissions it is create with, but the sheer fact that permissions are set at all. Here's an example of how it goes wrong when permissions are set explicitly. When that happens sticky bit gets lost. Here's a trivial example. As you can see, when a subdir is created it is, at first, retains the sticky bit. But when I set permissions explicitly it gets cleared (as it should): rvs@ahmed-laptop:/tmp$ cd /tmp rvs@ahmed-laptop:/tmp$ mkdir sticky.dir rvs@ahmed-laptop:/tmp$ sudo chgrp root sticky.dir rvs@ahmed-laptop:/tmp$ sudo chmod g+s sticky.dir rvs@ahmed-laptop:/tmp$ mkdir sticky.dir/subdir rvs@ahmed-laptop:/tmp$ ls -l sticky.dir total 4 drwxr-sr-x 2 rvs root 4096 2012-01-25 13:50 subdir rvs@ahmed-laptop:/tmp$ chmod 777 sticky.dir/subdir rvs@ahmed-laptop:/tmp$ ls -l sticky.dir total 4 drwxrwxrwx 2 rvs root 4096 2012-01-25 13:50 subdir .bq Instead I suspect that the umask for your testuser is strict. That my be the reason why the files are getting created with 640. Can you please check? The problem is not file permissions per se, but the fact that subdirs (starting from output) have lost a group sticky bit and thus everything that is created underneath them has user group as a default (and of course no longer retains the sticky group bit on the subdirectories). But anyway, I've checked and the umask for testuser is 0022 .bq Also, just to be sure, you are using DefaultContainerExecutor or LinuxContainerExecutor? It is mostly the later, but just confirming. This is LinuxContainerExecutor since, to the best of my knowledge, one has to configure it for the fully secure cluster.
        Hide
        Vinod Kumar Vavilapalli added a comment -

        Reveals that lfs.mkdir ends up manipulating permissions and thus clears sticky bit from output and filecache.

        May be I am missing something, but I don't understand this. The output-dir is being created with null FilePermissions which becomes 0777.

        Instead I suspect that the umask for your testuser is strict. That my be the reason why the files are getting created with 640. Can you please check?

        Also, just to be sure, you are using DefaultContainerExecutor or LinuxContainerExecutor? It is mostly the later, but just confirming.

        My understanding was that the whole sequence of events here was predicated on a sticky bit set so that daemons running under the user yarn (default group yarn) can have access to the resulting files and subdirectories down at output and below.

        Yes, that is the case with YARN and even mrv1/1.0.

        On a related note, when the shuffle side of the Pi job failed the job itself didn't. It went into the endless loop [..]

        Yes, this is a known issue MAPREDUCE-3418.

        Show
        Vinod Kumar Vavilapalli added a comment - Reveals that lfs.mkdir ends up manipulating permissions and thus clears sticky bit from output and filecache. May be I am missing something, but I don't understand this. The output-dir is being created with null FilePermissions which becomes 0777. Instead I suspect that the umask for your testuser is strict. That my be the reason why the files are getting created with 640. Can you please check? Also, just to be sure, you are using DefaultContainerExecutor or LinuxContainerExecutor? It is mostly the later, but just confirming. My understanding was that the whole sequence of events here was predicated on a sticky bit set so that daemons running under the user yarn (default group yarn) can have access to the resulting files and subdirectories down at output and below. Yes, that is the case with YARN and even mrv1/1.0. On a related note, when the shuffle side of the Pi job failed the job itself didn't. It went into the endless loop [..] Yes, this is a known issue MAPREDUCE-3418 .

          People

          • Assignee:
            Ding Yuan
            Reporter:
            Roman Shaposhnik
          • Votes:
            0 Vote for this issue
            Watchers:
            14 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development