Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-5765

Revert CHMOD on the new dirs created-LinuxContainerExecutor creates appcache and its subdirectories with wrong group owner.

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 2.8.0, 3.0.0-alpha1
    • Fix Version/s: 3.0.0-alpha2
    • Component/s: None
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      This change reverts YARN-5287 from 3.0.0-alpha1. chmod clears the set-group-ID bit of a regular file hence folder was getting reset with the rights.
      Show
      This change reverts YARN-5287 from 3.0.0-alpha1. chmod clears the set-group-ID bit of a regular file hence folder was getting reset with the rights.

      Description

      Earlier fix for YARN-5287 has been reverted due to this issue : LinuxContainerExecutor creates usercache/{userId}/appcache/{appId} with wrong group owner, causing Log aggregation and ShuffleHandler to fail because node manager process does not have permission to read the files under the directory.
      This can be easily reproduced by enabling LCE and submitting a MR example job as a user that does not belong to the same group that NM process belongs to.

      1. YARN-5765.001.patch
        2 kB
        Naganarasimha G R

        Issue Links

          Activity

          Hide
          Naganarasimha Naganarasimha G R added a comment -

          Thanks for identifying the issue Haibo Chen and reviews from Haibo Chen, Miklos Szegedi & Varun Vasudev.
          Will put the patch some of the commnts from here to the parent jira !

          Show
          Naganarasimha Naganarasimha G R added a comment - Thanks for identifying the issue Haibo Chen and reviews from Haibo Chen , Miklos Szegedi & Varun Vasudev . Will put the patch some of the commnts from here to the parent jira !
          Hide
          Naganarasimha Naganarasimha G R added a comment -

          Resolving the issue as fixed with proper update on release notes and description update

          Show
          Naganarasimha Naganarasimha G R added a comment - Resolving the issue as fixed with proper update on release notes and description update
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #10836 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10836/)
          Reverted due to issue YARN-5765. Revert "YARN-5287. (naganarasimha_gr: rev 43aef303bf9b71293b00c7ed6e8807d15274ca95)

          • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c
          • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.h
          • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/test/test-container-executor.c
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #10836 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10836/ ) Reverted due to issue YARN-5765 . Revert " YARN-5287 . (naganarasimha_gr: rev 43aef303bf9b71293b00c7ed6e8807d15274ca95) (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.h (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/test/test-container-executor.c
          Hide
          Naganarasimha Naganarasimha G R added a comment -

          YARN-5287 has been, Reverted in branch-2.8, branch-2 & trunk. Will update the release notes and resolve this jira as fixed

          Show
          Naganarasimha Naganarasimha G R added a comment - YARN-5287 has been, Reverted in branch-2.8, branch-2 & trunk. Will update the release notes and resolve this jira as fixed
          Hide
          Naganarasimha Naganarasimha G R added a comment -

          Varun Vasudev, i think i need to resolve the issue as fixed in trunk so that we have proper release notes right (so that the changes have this information of the revert)?

          Show
          Naganarasimha Naganarasimha G R added a comment - Varun Vasudev , i think i need to resolve the issue as fixed in trunk so that we have proper release notes right (so that the changes have this information of the revert)?
          Hide
          miklos.szegedi@cloudera.com Miklos Szegedi added a comment -

          Thank you, Naganarasimha G R for the reply!
          Just to finish this discusssion before you resolve this bug.
          "What i want to understand is what impact would it have if we do it always ? As we never run the container-executor.c binary with root user refer (set_user -> check_user) and would it be sufficient to reset the umask after mkdir ?"
          I was more concerned about any future changes, who will just call mkdirs without knowing that it actually changes umask. Resetting the umask mitigates possible side effects. All in all, if the umask change is chosen, I think it is best to put it into create_container_directories.
          "One more question is if we reset the umask after mkdir then will the container logs created will be accessible to the NM because of restrictive rights ? would be ideal to set default ACL for the folders created and reset the umask so that files created by the user under these directories have the rightful permissions?"
          I think a more future proof approach to default acls and umask could be a feature like the current bin/container-executor --checksetup, when specifying the user and root dir tells the administrator, if the user/dir has the right settings to run containers. It gives an advice what needs to be set if it does not. This has a different philosophy, I admit.
          Does this answer your questions?

          Show
          miklos.szegedi@cloudera.com Miklos Szegedi added a comment - Thank you, Naganarasimha G R for the reply! Just to finish this discusssion before you resolve this bug. "What i want to understand is what impact would it have if we do it always ? As we never run the container-executor.c binary with root user refer (set_user -> check_user) and would it be sufficient to reset the umask after mkdir ?" I was more concerned about any future changes, who will just call mkdirs without knowing that it actually changes umask. Resetting the umask mitigates possible side effects. All in all, if the umask change is chosen, I think it is best to put it into create_container_directories. "One more question is if we reset the umask after mkdir then will the container logs created will be accessible to the NM because of restrictive rights ? would be ideal to set default ACL for the folders created and reset the umask so that files created by the user under these directories have the rightful permissions?" I think a more future proof approach to default acls and umask could be a feature like the current bin/container-executor --checksetup, when specifying the user and root dir tells the administrator, if the user/dir has the right settings to run containers. It gives an advice what needs to be set if it does not. This has a different philosophy, I admit. Does this answer your questions?
          Hide
          vvasudev Varun Vasudev added a comment -

          Let's re-open YARN-5287 and find the right fix there. We can just close this issue once the reverts are done.

          Show
          vvasudev Varun Vasudev added a comment - Let's re-open YARN-5287 and find the right fix there. We can just close this issue once the reverts are done.
          Hide
          Naganarasimha Naganarasimha G R added a comment -

          Varun Vasudev,
          Ok no problem in reverting in other branches too was just considering that there was no immediate releases for 3.0 and 2.9, But as you said there was no point in divering the code. In that case would it be good to reopen YARN-5287 and close this jira or how to handle YARN-5287 jira status ?

          Show
          Naganarasimha Naganarasimha G R added a comment - Varun Vasudev , Ok no problem in reverting in other branches too was just considering that there was no immediate releases for 3.0 and 2.9, But as you said there was no point in divering the code. In that case would it be good to reopen YARN-5287 and close this jira or how to handle YARN-5287 jira status ?
          Hide
          vvasudev Varun Vasudev added a comment -

          Naganarasimha Garla - can you please revert YARN-5287 on trunk and branch-2 as well? There's no reason to diverge 2.8, plus it's currently blocking the 3.0 alpha 2 release and it's a pain for developers who are developing on branch-2.

          Show
          vvasudev Varun Vasudev added a comment - Naganarasimha Garla - can you please revert YARN-5287 on trunk and branch-2 as well? There's no reason to diverge 2.8, plus it's currently blocking the 3.0 alpha 2 release and it's a pain for developers who are developing on branch-2.
          Hide
          Naganarasimha Naganarasimha G R added a comment -

          As we had not reached any conclusion, have reverted the YARN-5287 patch in branch-2.8.
          Hope to conclude faster and push it before any further releases are done on trunk or branch-2 !

          Show
          Naganarasimha Naganarasimha G R added a comment - As we had not reached any conclusion, have reverted the YARN-5287 patch in branch-2.8. Hope to conclude faster and push it before any further releases are done on trunk or branch-2 !
          Hide
          Naganarasimha Naganarasimha G R added a comment -

          @Thanks Haibo Chen & Miklos Szegedi for some insightful comments
          There are 2 other places apart from launch_container_as_user where in mkdirs are getting used.

          main
          RUN_AS_USER_INITIALIZE_CONTAINER
          	mount_cgroup
          	    mkdirs
          		create_validate_dir
          MOUNT_CGROUPS
          	initialize_app
          	    mkdirs
          		create_validate_dir
          

          IIUC only setting umask before change_effective_user would not be ideal as it would be required in other places too.
          What i want to understand is what impact would it have if we do it always ? As we never run the container-executor.c binary with root user refer (set_user -> check_user) and would it be sufficient to reset the umask after mkdir ?

          This means that by removing chmod this change does not apply to cases anymore, when the default ACL is too restrictive. Could this be an issue, or do we rely on the admin to set the default ACL correctly?

          Good query ... something to be thought about ! not sure we will be able to handle it. One more question is if we reset the umask after mkdir then will the container logs created will be accessible to the NM because of restrictive rights ? would be ideal to set default ACL for the folders created and reset the umask so that files created by the user under these directories have the rightful permissions?

          Show
          Naganarasimha Naganarasimha G R added a comment - @Thanks Haibo Chen & Miklos Szegedi for some insightful comments There are 2 other places apart from launch_container_as_user where in mkdirs are getting used. main RUN_AS_USER_INITIALIZE_CONTAINER mount_cgroup mkdirs create_validate_dir MOUNT_CGROUPS initialize_app mkdirs create_validate_dir IIUC only setting umask before change_effective_user would not be ideal as it would be required in other places too. What i want to understand is what impact would it have if we do it always ? As we never run the container-executor.c binary with root user refer (set_user -> check_user) and would it be sufficient to reset the umask after mkdir ? This means that by removing chmod this change does not apply to cases anymore, when the default ACL is too restrictive. Could this be an issue, or do we rely on the admin to set the default ACL correctly? Good query ... something to be thought about ! not sure we will be able to handle it. One more question is if we reset the umask after mkdir then will the container logs created will be accessible to the NM because of restrictive rights ? would be ideal to set default ACL for the folders created and reset the umask so that files created by the user under these directories have the rightful permissions?
          Hide
          miklos.szegedi@cloudera.com Miklos Szegedi added a comment -

          Thank you, Naganarasimha G R for the patch and Haibo Chen for the review. If I understand it correctly, this is the flow of calls.

          launch_container_as_user
            fork
              create_local_dirs
                create_log_dirs
                  mkdir
                change_effective_user
                create_container_directories
                  mkdirs
                    create_validate_dir
          

          We cannot change umask before change_effective_user() I think and changing it in mkdirs() or create_validate_dir() may add side effects to other callers of mkdirs() in the future as Haibo Chen mentioned. What I would do is to set the umask at the beginning of create_container_directories right at the comment below

          // create dirs as 0750
          umask(0027);
          

          I would also reset it to the previous value, before it returns.
          Just a side note: This is what the Linux man page says about mkdir(): "in the absence of a default ACL, the mode of the created directory is (mode & ~umask & 0777)"
          This means that by removing chmod this change does not apply to cases anymore, when the default ACL is too restrictive. Could this be an issue, or do we rely on the admin to set the default ACL correctly?

          Show
          miklos.szegedi@cloudera.com Miklos Szegedi added a comment - Thank you, Naganarasimha G R for the patch and Haibo Chen for the review. If I understand it correctly, this is the flow of calls. launch_container_as_user fork create_local_dirs create_log_dirs mkdir change_effective_user create_container_directories mkdirs create_validate_dir We cannot change umask before change_effective_user() I think and changing it in mkdirs() or create_validate_dir() may add side effects to other callers of mkdirs() in the future as Haibo Chen mentioned. What I would do is to set the umask at the beginning of create_container_directories right at the comment below // create dirs as 0750 umask(0027); I would also reset it to the previous value, before it returns. Just a side note: This is what the Linux man page says about mkdir(): "in the absence of a default ACL, the mode of the created directory is (mode & ~umask & 0777)" This means that by removing chmod this change does not apply to cases anymore, when the default ACL is too restrictive. Could this be an issue, or do we rely on the admin to set the default ACL correctly?
          Hide
          haibochen Haibo Chen added a comment -

          As expected, your fix worked in my cluster test.

          Show
          haibochen Haibo Chen added a comment - As expected, your fix worked in my cluster test.
          Hide
          haibochen Haibo Chen added a comment -

          Thanks Naganarasimha G R for the patch! I'd expect this to work since I have tested a similar change. While I am testing it in a cluster, we can discuss what is the best place to put the umask() system call. IMHO, create_validate_dir may not be the best place to put it. The reason being processes that execute mkdirs() may not notice their umasks have been modified to 0027 in create_validate_dir(). If a process, say root (I am not sure though whether umask(0027) is unwanted by root), wants to use a different umask other than 0027, it will need to remember to call umask again after mkdirs(). Instead, I think we can call umask() right after fork in launch_container_as_user() or in create_container_directories(). This way, only umask of the child process that runs as the user will be changed to 0027.

          Show
          haibochen Haibo Chen added a comment - Thanks Naganarasimha G R for the patch! I'd expect this to work since I have tested a similar change. While I am testing it in a cluster, we can discuss what is the best place to put the umask() system call. IMHO, create_validate_dir may not be the best place to put it. The reason being processes that execute mkdirs() may not notice their umasks have been modified to 0027 in create_validate_dir(). If a process, say root (I am not sure though whether umask(0027) is unwanted by root), wants to use a different umask other than 0027, it will need to remember to call umask again after mkdirs(). Instead, I think we can call umask() right after fork in launch_container_as_user() or in create_container_directories(). This way, only umask of the child process that runs as the user will be changed to 0027.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 4m 20s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
          +1 mvninstall 6m 56s trunk passed
          +1 compile 0m 26s trunk passed
          +1 mvnsite 0m 28s trunk passed
          +1 mvneclipse 0m 13s trunk passed
          +1 mvninstall 0m 22s the patch passed
          +1 compile 0m 23s the patch passed
          +1 cc 0m 23s the patch passed
          +1 javac 0m 23s the patch passed
          +1 mvnsite 0m 24s the patch passed
          +1 mvneclipse 0m 11s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          -1 unit 15m 44s hadoop-yarn-server-nodemanager in the patch failed.
          +1 asflicense 0m 15s The patch does not generate ASF License warnings.
          30m 4s



          Reason Tests
          Failed junit tests hadoop.yarn.server.nodemanager.containermanager.queuing.TestQueuingContainerManager



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:e809691
          JIRA Issue YARN-5765
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12837669/YARN-5765.001.patch
          Optional Tests asflicense compile cc mvnsite javac unit
          uname Linux 462b8b244822 3.13.0-95-generic #142-Ubuntu SMP Fri Aug 12 17:00:09 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 59bc84a
          Default Java 1.8.0_101
          unit https://builds.apache.org/job/PreCommit-YARN-Build/13799/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/13799/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/13799/console
          Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 4m 20s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 mvninstall 6m 56s trunk passed +1 compile 0m 26s trunk passed +1 mvnsite 0m 28s trunk passed +1 mvneclipse 0m 13s trunk passed +1 mvninstall 0m 22s the patch passed +1 compile 0m 23s the patch passed +1 cc 0m 23s the patch passed +1 javac 0m 23s the patch passed +1 mvnsite 0m 24s the patch passed +1 mvneclipse 0m 11s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. -1 unit 15m 44s hadoop-yarn-server-nodemanager in the patch failed. +1 asflicense 0m 15s The patch does not generate ASF License warnings. 30m 4s Reason Tests Failed junit tests hadoop.yarn.server.nodemanager.containermanager.queuing.TestQueuingContainerManager Subsystem Report/Notes Docker Image:yetus/hadoop:e809691 JIRA Issue YARN-5765 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12837669/YARN-5765.001.patch Optional Tests asflicense compile cc mvnsite javac unit uname Linux 462b8b244822 3.13.0-95-generic #142-Ubuntu SMP Fri Aug 12 17:00:09 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 59bc84a Default Java 1.8.0_101 unit https://builds.apache.org/job/PreCommit-YARN-Build/13799/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/13799/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/13799/console Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          Naganarasimha Naganarasimha G R added a comment -

          Thanks Haibo Chen,
          Sorry for the delay and never mind to give comments at any point of time...
          what i meant was not just add "umask(0027)" but kind of revert the solution of YARN-5287 and to add "umask(0027)" to solve the original issue for which chmod was introduced.
          Attached the patch and it would be helpful if you can test it.
          Have not added test case as YARN-5287 have already added it

          Show
          Naganarasimha Naganarasimha G R added a comment - Thanks Haibo Chen , Sorry for the delay and never mind to give comments at any point of time... what i meant was not just add "umask(0027)" but kind of revert the solution of YARN-5287 and to add "umask(0027)" to solve the original issue for which chmod was introduced. Attached the patch and it would be helpful if you can test it. Have not added test case as YARN-5287 have already added it
          Hide
          haibochen Haibo Chen added a comment - - edited

          Hey, Naganarasimha G R Sorry for getting in the way again while you are working to post a patch. For some reason that I am not aware of, umask(0027) did not work for me. And here is my theory why it does not work.
          Please correct me if I am wrong.

          In a nutshell, in my case, the appcache directory is first created with ownership being {user: systest, group: hadoop} and gid set. Then because systest does not belong to group hadoop in my cluster setup,
          chmod(appcache, perm) that runs as user systest will clear the gid. Consequentially, all the subdirectories and files created under appcache by code that runs as systest will not be owned by group hadoop any more.
          NM that runs as a user that only belongs to hadoop will have no permission to these files/directories.

          In my experiment with umask, it seems to me that chmod clears the gid as long as systest does not belong to hadoop, regardless of the umask. If umask approach has been working for you, I must have misunderstood
          your approach and will verify your patch once it's posted. One alternative I can think of, is to revert YARN-5287 patch and follow the other approach as you pointed out (explicitly set umask) to fix YARN-5287. Thought?

          Show
          haibochen Haibo Chen added a comment - - edited Hey, Naganarasimha G R Sorry for getting in the way again while you are working to post a patch. For some reason that I am not aware of, umask(0027) did not work for me. And here is my theory why it does not work. Please correct me if I am wrong. In a nutshell, in my case, the appcache directory is first created with ownership being {user: systest, group: hadoop} and gid set. Then because systest does not belong to group hadoop in my cluster setup, chmod(appcache, perm) that runs as user systest will clear the gid. Consequentially, all the subdirectories and files created under appcache by code that runs as systest will not be owned by group hadoop any more. NM that runs as a user that only belongs to hadoop will have no permission to these files/directories. In my experiment with umask, it seems to me that chmod clears the gid as long as systest does not belong to hadoop, regardless of the umask. If umask approach has been working for you, I must have misunderstood your approach and will verify your patch once it's posted. One alternative I can think of, is to revert YARN-5287 patch and follow the other approach as you pointed out (explicitly set umask) to fix YARN-5287 . Thought?
          Hide
          Naganarasimha Naganarasimha G R added a comment -

          Well thanks for assigning , but was more interested whether you could validate umask fits in your case too.., i will update a patch hope you could help me in validating the same !

          Show
          Naganarasimha Naganarasimha G R added a comment - Well thanks for assigning , but was more interested whether you could validate umask fits in your case too.., i will update a patch hope you could help me in validating the same !
          Hide
          haibochen Haibo Chen added a comment -

          Sorry Naganarasimha G R, did not mean to dismiss your solution. Assigning this to you since you have got a working solution. Let's resolve this blocker. Ideally, I still prefer to make changes to mkdirs() which all directory creations have to call.

          Show
          haibochen Haibo Chen added a comment - Sorry Naganarasimha G R , did not mean to dismiss your solution. Assigning this to you since you have got a working solution. Let's resolve this blocker. Ideally, I still prefer to make changes to mkdirs() which all directory creations have to call.
          Hide
          Naganarasimha Naganarasimha G R added a comment -

          I am assuming you are doing again stat(npath, &sb) != 0 after if (mkdir(npath, perm) != 0) in the else block before the code block which you have shared...

          We have been using the first approach umask(0027) in our code base for a while and i am pretty sure it should work fine, why not try that once ?

          As its a blocker i plan to revert the YARN-5287 patch in 2.8, hope we can conclude on something earlier.

          Show
          Naganarasimha Naganarasimha G R added a comment - I am assuming you are doing again stat(npath, &sb) != 0 after if (mkdir(npath, perm) != 0) in the else block before the code block which you have shared... We have been using the first approach umask(0027) in our code base for a while and i am pretty sure it should work fine, why not try that once ? As its a blocker i plan to revert the YARN-5287 patch in 2.8, hope we can conclude on something earlier.
          Hide
          haibochen Haibo Chen added a comment - - edited

          This is what I am doing.

           if ((sb.st_mode & S_ISUID) != 0) {
                  perm = perm | S_ISUID;
                }
          if ((sb.st_mode & S_ISGID) != 0) {
                  perm = perm | S_ISGID;
          }
          

          But from the test that I have done, it seems like I am still missing something, this change alone does not fix everything. I am chasing it down.

          Show
          haibochen Haibo Chen added a comment - - edited This is what I am doing. if ((sb.st_mode & S_ISUID) != 0) { perm = perm | S_ISUID; } if ((sb.st_mode & S_ISGID) != 0) { perm = perm | S_ISGID; } But from the test that I have done, it seems like I am still missing something, this change alone does not fix everything. I am chasing it down.
          Hide
          Naganarasimha Naganarasimha G R added a comment -

          Not sure how we can conditionally(on what conditions) set the setGid before chmod. May be can can get your approach once you share the patch.

          Show
          Naganarasimha Naganarasimha G R added a comment - Not sure how we can conditionally(on what conditions) set the setGid before chmod. May be can can get your approach once you share the patch.
          Hide
          haibochen Haibo Chen added a comment -

          Sorry for the delayed response. I am not sure whether some subdirectories assume setGid being unset or not. One solution that I can think of to minimize the changes is to conditionally set setGid and setUid right before chmod in create_validate_dir. I am concurrently testing that. Will upload a patch for preliminary reviews.

          Show
          haibochen Haibo Chen added a comment - Sorry for the delayed response. I am not sure whether some subdirectories assume setGid being unset or not. One solution that I can think of to minimize the changes is to conditionally set setGid and setUid right before chmod in create_validate_dir. I am concurrently testing that. Will upload a patch for preliminary reviews.
          Hide
          Naganarasimha Naganarasimha G R added a comment -

          I think doing "perm = perm | S_ISGID" will set the setGID bit regardless of the original permission,

          Agree less riskier option is to go for the option 1 which is umask(0027); before calling mkdir, but having said that doubt what i have is setGid is required for only the user's appcache folder or recursively for all its children ? if recursively done is there any folder which we create do not require this ?

          In between, do we need to raise a new jira for 3.0.0-alpha2 so that release notes gets added in specific which is not required for 2.8 and 2.9 ?

          Show
          Naganarasimha Naganarasimha G R added a comment - I think doing "perm = perm | S_ISGID" will set the setGID bit regardless of the original permission, Agree less riskier option is to go for the option 1 which is umask(0027); before calling mkdir , but having said that doubt what i have is setGid is required for only the user's appcache folder or recursively for all its children ? if recursively done is there any folder which we create do not require this ? In between, do we need to raise a new jira for 3.0.0-alpha2 so that release notes gets added in specific which is not required for 2.8 and 2.9 ?
          Hide
          haibochen Haibo Chen added a comment -

          I think doing "perm = perm | S_ISGID" will set the setGID bit regardless of the original permission, which may not be something we want.

          Show
          haibochen Haibo Chen added a comment - I think doing "perm = perm | S_ISGID" will set the setGID bit regardless of the original permission, which may not be something we want.
          Hide
          Naganarasimha Naganarasimha G R added a comment -

          Thanks Haibo Chen, for giving some insights on the chmod command, was not aware of it, So at first glance As initially suggested it would be be ideal to set umask as {{ umask(0027);}} would ideal than chmod in create_validate_dir but before making the system call mkdir .This will ensure that if the cluster has been configured with a restrictive umask "e.g.: umask 077" still the directories are set with proper rights. or another approach i can think of is setting in this way but not sure whether it works ...

          } else {
           // Explicitly set permission after creating the directory in case
           // umask has been set to a restrictive value, i.e., 0077.
           perm = perm | S_ISGID;
           if (chmod(npath, perm) != 0) {
            int permInt = perm & (S_IRWXU | S_IRWXG | S_IRWXO);
            fprintf(LOGFILE, "Can't chmod %s to the required permission %o - %s\n",
          	                npath, permInt, strerror(errno));
             return -1;
          }
          

          Thoughts ?

          Show
          Naganarasimha Naganarasimha G R added a comment - Thanks Haibo Chen , for giving some insights on the chmod command, was not aware of it, So at first glance As initially suggested it would be be ideal to set umask as {{ umask(0027);}} would ideal than chmod in create_validate_dir but before making the system call mkdir .This will ensure that if the cluster has been configured with a restrictive umask "e.g.: umask 077" still the directories are set with proper rights. or another approach i can think of is setting in this way but not sure whether it works ... } else { // Explicitly set permission after creating the directory in case // umask has been set to a restrictive value, i.e., 0077. perm = perm | S_ISGID; if (chmod(npath, perm) != 0) { int permInt = perm & (S_IRWXU | S_IRWXG | S_IRWXO); fprintf(LOGFILE, "Can't chmod %s to the required permission %o - %s\n" , npath, permInt, strerror(errno)); return -1; } Thoughts ?
          Hide
          Naganarasimha Naganarasimha G R added a comment -

          My Bad, did not mark the release versions after committing the patch to 2.8 and 2.9.

          Show
          Naganarasimha Naganarasimha G R added a comment - My Bad, did not mark the release versions after committing the patch to 2.8 and 2.9.
          Hide
          vvasudev Varun Vasudev added a comment -

          Definitely a blocker for 2.8 as well. cc Naganarasimha Garla and Ying Zhang who worked on YARN-5287.

          Show
          vvasudev Varun Vasudev added a comment - Definitely a blocker for 2.8 as well. cc Naganarasimha Garla and Ying Zhang who worked on YARN-5287 .
          Hide
          jlowe Jason Lowe added a comment -

          Linking the two tickets. YARN-5287 went into 2.8, but this is marked as only affecting 3.0.0-alpha1. Should this be a blocker for 2.8 as well?

          Show
          jlowe Jason Lowe added a comment - Linking the two tickets. YARN-5287 went into 2.8, but this is marked as only affecting 3.0.0-alpha1. Should this be a blocker for 2.8 as well?
          Hide
          haibochen Haibo Chen added a comment -

          I believe this is broken by YARN-5287.

          "chmod clears the set-group-ID bit of a regular file if the file's group ID does not match the user's effective group ID or one of the user's supplementary group IDs, unless the user has appropriate privileges. " According to linux man page. This is inline with the reproduction setup I had.

          Walking through the container-executor.c code,

          {nm_root}/usercache/{userName} is created with correct permission with the group owner being that of the nm process and Setgid set. However, in create_validate_dir(), "mkdir(npath, perm) != 0" returns false on directory {nm_root}

          /usercache/

          {userName}

          /appcache, so chmod(npath, perm) is executed on the directory, clearing the Setgid Bits. Consequentially, all directories/files created under the appcache directory have the wrong group owner.

          The container working directory is also created with the same code, therefore, having wrong group owner as well.

          Show
          haibochen Haibo Chen added a comment - I believe this is broken by YARN-5287 . "chmod clears the set-group-ID bit of a regular file if the file's group ID does not match the user's effective group ID or one of the user's supplementary group IDs, unless the user has appropriate privileges. " According to linux man page. This is inline with the reproduction setup I had. Walking through the container-executor.c code, {nm_root}/usercache/{userName} is created with correct permission with the group owner being that of the nm process and Setgid set. However, in create_validate_dir(), "mkdir(npath, perm) != 0" returns false on directory {nm_root} /usercache/ {userName} /appcache, so chmod(npath, perm) is executed on the directory, clearing the Setgid Bits. Consequentially, all directories/files created under the appcache directory have the wrong group owner. The container working directory is also created with the same code, therefore, having wrong group owner as well.

            People

            • Assignee:
              Naganarasimha Naganarasimha G R
              Reporter:
              haibochen Haibo Chen
            • Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development