Hadoop Common
  1. Hadoop Common
  2. HADOOP-3698

Implement access control for submitting jobs to queues in the JobTracker

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.19.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      HADOOP-3445 implements multiple queues in the JobTracker as part of the new resource manager for Hadoop (HADOOP-3421). There needs to be a mechanism to control who can submit jobs to a specified queue. This JIRA is for tracking the requirements, approach and implementation for the same.

      1. HADOOP-3698.patch
        13 kB
        Hemanth Yamijala
      2. HADOOP-3698.patch
        32 kB
        Hemanth Yamijala
      3. HADOOP-3698.patch
        40 kB
        Hemanth Yamijala
      4. HADOOP-3698.patch
        43 kB
        Hemanth Yamijala
      5. HADOOP-3698.patch
        43 kB
        Hemanth Yamijala

        Issue Links

          Activity

          Owen O'Malley made changes -
          Component/s mapred [ 12310690 ]
          Nigel Daley made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-trunk #595 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/595/ )
          Hide
          Hemanth Yamijala added a comment -

          Thanks Owen.

          Show
          Hemanth Yamijala added a comment - Thanks Owen.
          Owen O'Malley made changes -
          Resolution Fixed [ 1 ]
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Hide
          Owen O'Malley added a comment -

          I just committed this. Thanks, Hemanth!

          Show
          Owen O'Malley added a comment - I just committed this. Thanks, Hemanth!
          Hide
          Hemanth Yamijala added a comment -

          Output of ant test-patch:

          [exec] +1 overall.

          [exec] +1 @author. The patch does not contain any @author tags.

          [exec] +1 tests included. The patch appears to include 9 new or modified tests.

          [exec] +1 javadoc. The javadoc tool did not generate any warning messages.

          [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.

          I also ran ant test and ant contrib-test. There was one failure for TestKosmosFileSystem. But that was because of HADOOP-4069.

          Show
          Hemanth Yamijala added a comment - Output of ant test-patch: [exec] +1 overall. [exec] +1 @author. The patch does not contain any @author tags. [exec] +1 tests included. The patch appears to include 9 new or modified tests. [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. I also ran ant test and ant contrib-test. There was one failure for TestKosmosFileSystem. But that was because of HADOOP-4069 .
          Hemanth Yamijala made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Hemanth Yamijala added a comment -

          Hudson did not pick this patch up. Adding again.

          Show
          Hemanth Yamijala added a comment - Hudson did not pick this patch up. Adding again.
          Hemanth Yamijala made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hemanth Yamijala made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Hemanth Yamijala added a comment -

          Running through Hudson again.

          Show
          Hemanth Yamijala added a comment - Running through Hudson again.
          Hemanth Yamijala made changes -
          Attachment HADOOP-3698.patch [ 12389417 ]
          Hide
          Hemanth Yamijala added a comment -

          Thanks for the verification, Owen. The attached patch fixes the test failures. One of the things I realized is that the name AccessControlException was clashing in some test cases with the class of the same name in org.apache.hadoop.fs.permissions package, and it seemed dangerous to include the same named class. So, I've modified the name to AccessControlIOException, which seems fair as it extends from an IOException.

          Show
          Hemanth Yamijala added a comment - Thanks for the verification, Owen. The attached patch fixes the test failures. One of the things I realized is that the name AccessControlException was clashing in some test cases with the class of the same name in org.apache.hadoop.fs.permissions package, and it seemed dangerous to include the same named class. So, I've modified the name to AccessControlIOException, which seems fair as it extends from an IOException.
          Hemanth Yamijala made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12389299/HADOOP-3698.patch
          against trunk revision 691306.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3163/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3163/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3163/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3163/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12389299/HADOOP-3698.patch against trunk revision 691306. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3163/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3163/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3163/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3163/console This message is automatically generated.
          Owen O'Malley made changes -
          Hadoop Flags [Reviewed]
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Owen O'Malley added a comment -

          This is looking good. Pushing through HadoopQA.

          Show
          Owen O'Malley added a comment - This is looking good. Pushing through HadoopQA.
          Hemanth Yamijala made changes -
          Attachment HADOOP-3698.patch [ 12389299 ]
          Hide
          Hemanth Yamijala added a comment -

          Minor changes to earlier patch - mainly added 2 additional tests to test submitting jobs to non default queues, and also included the queue name property in hadoop-default.xml.

          Show
          Hemanth Yamijala added a comment - Minor changes to earlier patch - mainly added 2 additional tests to test submitting jobs to non default queues, and also included the queue name property in hadoop-default.xml.
          Hide
          Hemanth Yamijala added a comment -

          I had an offline discussion with Owen on his review comments. Here's the summary, and explanation of the changes in the patch with respect to these.

          The JobClient should use UserGroupInformation.login(conf) rather than UnixUserGroupInformation.login(conf).

          The JobClient code was already using UnixUserGroupInformation.login(conf, saveToConf). There was no API equivalent to this one in UserGroupInformation. Discussing options, we decided it's best to have a new JIRA to address this issue in UserGroupInformation. I will open the JIRA and link it off here. Until then, because this issue is holding up other JIRAs, we decided to retain the current code as is.

          I think we should have an exception class to catch for login failures. So we either need to add LoginException to submitJob in JobClient or we should create a new IOException that is for login failures and put it into org.apache.hadoop.security. I'd lean toward changing the signature of run/submitJob.

          Because run/submitJob throwing a new Exception would cause backwards incompatibility, we decided to go with the other option, that is create a sub-class of IOException in org.apache.hadoop.security. Looking at it, I've seen there's already an AccessControlException in org.apache.hadoop.fs.permission. Since both classes are identical, I think we should leave one (possibly the one being defined here), and deprecate and ultimately replace the other. This could be part of the cleanup JIRA being opened above. Thoughts ?
          Also, please note that UserGroupInformation still throws only LoginException. This seems fair because the calls that generate this are no remote calls, and hence don't need to throw any IOException or any sub-class of it.

          I'd suggest handling mapred.acls.enabled inside of the QueueManager instead of the JobTracker. The same for giving the user access to their own jobs. That should be done in QueueManager, so that all of the rules & exceptions for allowing actions are contained inside the one class.

          Done this. I've defined a new API in QueueManager to handle this.

          boolean hasAccess(String queueName, JobInProgress job, UserGroupInformation user, QueueOperation action) { ... }
          

          The JobTracker should use the standard method for getting the queue of the job instead of using the attribute name "queue.name".

          This was being done in HADOOP-3445. Still, I've defined these changes in the JobConf and JobProfile classes to get the queue name of a job. It defaults to "default" - which is the default queue defined in the queue manager. This change might need this JIRA to be committed before HADOOP-3445.

          Why does the TaskScheduler need the QueueManager? That seems really unexpected.

          This was done as discussed in HADOOP-3930 here. However, Owen suggested that this be defined as part of the TaskTrackerManager interface, and define a getQueueManager in the JobTracker. Implemented that change.

          Show
          Hemanth Yamijala added a comment - I had an offline discussion with Owen on his review comments. Here's the summary, and explanation of the changes in the patch with respect to these. The JobClient should use UserGroupInformation.login(conf) rather than UnixUserGroupInformation.login(conf). The JobClient code was already using UnixUserGroupInformation.login(conf, saveToConf) . There was no API equivalent to this one in UserGroupInformation. Discussing options, we decided it's best to have a new JIRA to address this issue in UserGroupInformation. I will open the JIRA and link it off here. Until then, because this issue is holding up other JIRAs, we decided to retain the current code as is. I think we should have an exception class to catch for login failures. So we either need to add LoginException to submitJob in JobClient or we should create a new IOException that is for login failures and put it into org.apache.hadoop.security. I'd lean toward changing the signature of run/submitJob. Because run/submitJob throwing a new Exception would cause backwards incompatibility, we decided to go with the other option, that is create a sub-class of IOException in org.apache.hadoop.security. Looking at it, I've seen there's already an AccessControlException in org.apache.hadoop.fs.permission. Since both classes are identical, I think we should leave one (possibly the one being defined here), and deprecate and ultimately replace the other. This could be part of the cleanup JIRA being opened above. Thoughts ? Also, please note that UserGroupInformation still throws only LoginException. This seems fair because the calls that generate this are no remote calls, and hence don't need to throw any IOException or any sub-class of it. I'd suggest handling mapred.acls.enabled inside of the QueueManager instead of the JobTracker. The same for giving the user access to their own jobs. That should be done in QueueManager, so that all of the rules & exceptions for allowing actions are contained inside the one class. Done this. I've defined a new API in QueueManager to handle this. boolean hasAccess( String queueName, JobInProgress job, UserGroupInformation user, QueueOperation action) { ... } The JobTracker should use the standard method for getting the queue of the job instead of using the attribute name "queue.name". This was being done in HADOOP-3445 . Still, I've defined these changes in the JobConf and JobProfile classes to get the queue name of a job. It defaults to "default" - which is the default queue defined in the queue manager. This change might need this JIRA to be committed before HADOOP-3445 . Why does the TaskScheduler need the QueueManager? That seems really unexpected. This was done as discussed in HADOOP-3930 here . However, Owen suggested that this be defined as part of the TaskTrackerManager interface, and define a getQueueManager in the JobTracker. Implemented that change.
          Hemanth Yamijala made changes -
          Attachment HADOOP-3698.patch [ 12389288 ]
          Hide
          Hemanth Yamijala added a comment -

          New patch incorporating review comments.

          Show
          Hemanth Yamijala added a comment - New patch incorporating review comments.
          Hide
          Owen O'Malley added a comment -

          The JobClient should use UserGroupInformation.login(conf) rather than UnixUserGroupInformation.login(conf).

          I think we should have an exception class to catch for login failures. So we either need to add LoginException to submitJob in JobClient or we should create a new IOException that is for login failures and put it into org.apache.hadoop.security. I'd lean toward changing the signature of run/submitJob.

          I'd suggest handling mapred.acls.enabled inside of the QueueManager instead of the JobTracker. The same for giving the user access to their own jobs. That should be done in QueueManager, so that all of the rules & exceptions for allowing actions are contained inside the one class.

          The JobTracker should use the standard method for getting the queue of the job instead of using the attribute name "queue.name".

          Why does the TaskScheduler need the QueueManager? That seems really unexpected.

          Show
          Owen O'Malley added a comment - The JobClient should use UserGroupInformation.login(conf) rather than UnixUserGroupInformation.login(conf). I think we should have an exception class to catch for login failures. So we either need to add LoginException to submitJob in JobClient or we should create a new IOException that is for login failures and put it into org.apache.hadoop.security. I'd lean toward changing the signature of run/submitJob. I'd suggest handling mapred.acls.enabled inside of the QueueManager instead of the JobTracker. The same for giving the user access to their own jobs. That should be done in QueueManager, so that all of the rules & exceptions for allowing actions are contained inside the one class. The JobTracker should use the standard method for getting the queue of the job instead of using the attribute name "queue.name". Why does the TaskScheduler need the QueueManager? That seems really unexpected.
          Hide
          Hemanth Yamijala added a comment -

          A more complete patch. Has unit tests and javadocs. Request to provide comments on the implementation.

          Since the QueueManager is the class being used by the framework, some changes are required to ResourceManagerConf (used only by HADOOP-3445). This includes moving it to a contrib package for schedulers, removing generic queue information out of it etc. I did not make these changes in this patch to keep the merging of patches simple. I will open a new JIRA for making those changes.

          Show
          Hemanth Yamijala added a comment - A more complete patch. Has unit tests and javadocs. Request to provide comments on the implementation. Since the QueueManager is the class being used by the framework, some changes are required to ResourceManagerConf (used only by HADOOP-3445 ). This includes moving it to a contrib package for schedulers, removing generic queue information out of it etc. I did not make these changes in this patch to keep the merging of patches simple. I will open a new JIRA for making those changes.
          Hemanth Yamijala made changes -
          Attachment HADOOP-3698.patch [ 12388976 ]
          Hide
          Hemanth Yamijala added a comment -

          I am attaching a patch for early review. It isn't complete, and is more to illustrate how the implementation is going to look. Please share your comments.

          The patch includes:

          • The QueueManager class that exposes the APIs as discussed above.
          • Changes to the APIs submitJob, killJob and killTask in the JobTracker to verify ACLs and throw an IOException if the verification fails.
          • Changes in JobClient to pass the UnixUserGroupInformation to the RPC Proxy, so that it gets set to the JobTracker.
          • Included the default queue and ACLs for them in hadoop-default.xml. The format used is the same as what was discussed in HADOOP-3479. I'm calling them mapred.queue.<queue-name>.<acl-operation-name>. The value has the format "<comma-separated-user-list> <comma-separated-group-list>". Open to suggestions on better format for this.

          Tested this patch manually for job submission, and killing of job and tasks and it is working fine.

          Things to do:

          • Remove ResourceManagerConf and migrate HADOOP-3445 scheduler methods into a separate class.
          • Documentation, unit tests

          Things to do, possibly outside this patch:

          • The web UI provides a way to change job priorities and the CLI does not. For ACLs to work properly, this should be inverted ?
          Show
          Hemanth Yamijala added a comment - I am attaching a patch for early review. It isn't complete, and is more to illustrate how the implementation is going to look. Please share your comments. The patch includes: The QueueManager class that exposes the APIs as discussed above. Changes to the APIs submitJob , killJob and killTask in the JobTracker to verify ACLs and throw an IOException if the verification fails. Changes in JobClient to pass the UnixUserGroupInformation to the RPC Proxy, so that it gets set to the JobTracker. Included the default queue and ACLs for them in hadoop-default.xml. The format used is the same as what was discussed in HADOOP-3479 . I'm calling them mapred.queue.<queue-name>.<acl-operation-name> . The value has the format "<comma-separated-user-list> <comma-separated-group-list>". Open to suggestions on better format for this. Tested this patch manually for job submission, and killing of job and tasks and it is working fine. Things to do: Remove ResourceManagerConf and migrate HADOOP-3445 scheduler methods into a separate class. Documentation, unit tests Things to do, possibly outside this patch: The web UI provides a way to change job priorities and the CLI does not. For ACLs to work properly, this should be inverted ?
          Hemanth Yamijala made changes -
          Attachment HADOOP-3698.patch [ 12388682 ]
          Hide
          Hemanth Yamijala added a comment -

          Just a minor comment here that while queues are considered part of the framework, they still are related to the scheduler that is configured. For e.g. the default scheduler JobQueueTaskScheduler does not support more than one queue. So, if someone configures two queue names, and leaves the scheduler as the default one, that might not work as expected. Probably the documentation of the queue names' configuration parameter should document that there is a relationship. Thanks to Sreekanth for pointing this out.

          Show
          Hemanth Yamijala added a comment - Just a minor comment here that while queues are considered part of the framework, they still are related to the scheduler that is configured. For e.g. the default scheduler JobQueueTaskScheduler does not support more than one queue. So, if someone configures two queue names, and leaves the scheduler as the default one, that might not work as expected. Probably the documentation of the queue names' configuration parameter should document that there is a relationship. Thanks to Sreekanth for pointing this out.
          Hide
          Hemanth Yamijala added a comment -

          +1 for renaming ResourceManagerConf to QueueManager. The earlier name had a connotation that it was related to the Configuration or JobConf classes, while it was not.

          I am not too keen on where the generic properties are defined. So, its a 1:1 vote for now.. will look to see if others have any specific preference.

          We also need:

          void refresh(Configuration conf);
          

          added to QueueManager.

          Show
          Hemanth Yamijala added a comment - +1 for renaming ResourceManagerConf to QueueManager . The earlier name had a connotation that it was related to the Configuration or JobConf classes, while it was not. I am not too keen on where the generic properties are defined. So, its a 1:1 vote for now.. will look to see if others have any specific preference. We also need: void refresh(Configuration conf); added to QueueManager .
          Hide
          Owen O'Malley added a comment -

          I'd prefer just using hadoop-site.xml for the generic properties.

          I'd suggest renaming the ResourceManagerConf to something like QueueManager and merging in QueueAccessVerifier. It would look something like:

          class QueueManager {
            QueueManager(Configuration conf) { ... }
            List<String> getQueueNames() { ... }
            static enum QueueOperation { SUBMIT_JOB, ADMINISTER_JOBS }
            boolean hasAccess(String queueName, UserGroupInformation user, QueueOperation action) { ... }
          }
          
          Show
          Owen O'Malley added a comment - I'd prefer just using hadoop-site.xml for the generic properties. I'd suggest renaming the ResourceManagerConf to something like QueueManager and merging in QueueAccessVerifier. It would look something like: class QueueManager { QueueManager(Configuration conf) { ... } List< String > getQueueNames() { ... } static enum QueueOperation { SUBMIT_JOB, ADMINISTER_JOBS } boolean hasAccess( String queueName, UserGroupInformation user, QueueOperation action) { ... } }
          Hide
          Vivek Ratan added a comment -

          my earlier comments were not quite right. Hemanth's point is valid: there is some queue configuration that is generic across all queues. This includes queue names and ACLs, for now. Then there is queue configuration that is different for different schedulers. We should put these two in separate files, though they're tied together by queue names, which need to be common across both.

          Should the generic queue config be in resource-manager-conf or in hadoop-default/hadoop-site? Part of the reason for keeping queue-specific stuff in a separate file was to isolate changes to haddop-default, while we figure out a more generic way to do queue config (for which HADOOP-3579 was opened). That reasoning still sounds valid, so the generic queue config stuff should be in resource-manager-conf till we resolve 3579.

          Show
          Vivek Ratan added a comment - my earlier comments were not quite right. Hemanth's point is valid: there is some queue configuration that is generic across all queues. This includes queue names and ACLs, for now. Then there is queue configuration that is different for different schedulers. We should put these two in separate files, though they're tied together by queue names, which need to be common across both. Should the generic queue config be in resource-manager-conf or in hadoop-default/hadoop-site? Part of the reason for keeping queue-specific stuff in a separate file was to isolate changes to haddop-default, while we figure out a more generic way to do queue config (for which HADOOP-3579 was opened). That reasoning still sounds valid, so the generic queue config stuff should be in resource-manager-conf till we resolve 3579.
          Hide
          Hemanth Yamijala added a comment -

          Assuming that queues will become part of the framework, I would like to propose the following mechanism to handle queue configuration. The requirements are:

          • Generic queue configuration, presently only queue names and ACLs, will be in a single place. I suggest we use the resource-manager-conf.xml to store this. But if people prefer hadoop-site that's fine too.
          • There should be APIs to access this generic queue configuration.
          • Scheduler specific queue configuration must not be in the generic queue configuration, as different schedulers would likely have different parameters to configure and they'd do them the way they want. Also, if we move to having schedulers in different packages or contrib, as being discussed in HADOOP-3746, this split would greatly help.

          This does have a disadvantage that queue names and their scheduler configuration are at different places, but that seems like a tradeoff required to keep some parts of the queue administration in the core framework itself.

          Does this sound right ?

          If yes, I propose to make the following changes:

          • Modify ResourceManagerConf to be the way to access information about queue names and ACLs, and remove all configuration specific to HADOOP-3445 from it.
          • Also, resource-manager-conf.xml will have only queue names and ACLs.
          • The configuration required for HADOOP-3445 will be moved out to a different file, and there will be a different package private class to access it.

          Comments, please.

          Show
          Hemanth Yamijala added a comment - Assuming that queues will become part of the framework, I would like to propose the following mechanism to handle queue configuration. The requirements are: Generic queue configuration, presently only queue names and ACLs, will be in a single place. I suggest we use the resource-manager-conf.xml to store this. But if people prefer hadoop-site that's fine too. There should be APIs to access this generic queue configuration. Scheduler specific queue configuration must not be in the generic queue configuration, as different schedulers would likely have different parameters to configure and they'd do them the way they want. Also, if we move to having schedulers in different packages or contrib, as being discussed in HADOOP-3746 , this split would greatly help. This does have a disadvantage that queue names and their scheduler configuration are at different places, but that seems like a tradeoff required to keep some parts of the queue administration in the core framework itself. Does this sound right ? If yes, I propose to make the following changes: Modify ResourceManagerConf to be the way to access information about queue names and ACLs, and remove all configuration specific to HADOOP-3445 from it. Also, resource-manager-conf.xml will have only queue names and ACLs. The configuration required for HADOOP-3445 will be moved out to a different file, and there will be a different package private class to access it. Comments, please.
          Hide
          Hemanth Yamijala added a comment -

          What do you mean by sysadmins? - do they have special priviledges beyond the ACL. I think it makes sense for the superuser and the supergroup (of HDFS) to have full permissions to job queues without the need for their username/groupname appear in the ACL. Is that what you meant by sysadmins?

          Sanjay, yes. This is what I mean by sysadmins. They have complete control over the system, without needing special access control to be set up for them.

          I suggest we add the "changeACL operation right now; its default list should be null (ie only the superuser/supergroup can modify it).

          Is this because you believe that this operation can be delegated to others users/groups ? Also, for this version of the resource manager, we don't have a UI for administering the queues. In other words, even if we declare the operation right now, there is going to be no place where we need to check this. I suggest we add this as a TODO comment to the Enum I've declared above, on the lines of what Owen proposed for listjobs. Makes sense ?

          Show
          Hemanth Yamijala added a comment - What do you mean by sysadmins? - do they have special priviledges beyond the ACL. I think it makes sense for the superuser and the supergroup (of HDFS) to have full permissions to job queues without the need for their username/groupname appear in the ACL. Is that what you meant by sysadmins? Sanjay, yes. This is what I mean by sysadmins. They have complete control over the system, without needing special access control to be set up for them. I suggest we add the "changeACL operation right now; its default list should be null (ie only the superuser/supergroup can modify it). Is this because you believe that this operation can be delegated to others users/groups ? Also, for this version of the resource manager, we don't have a UI for administering the queues. In other words, even if we declare the operation right now, there is going to be no place where we need to check this. I suggest we add this as a TODO comment to the Enum I've declared above, on the lines of what Owen proposed for listjobs. Makes sense ?
          Hide
          Vivek Ratan added a comment -

          Considering these, I think we have two options regarding where to specify queues, and scheduler specific queue configuration:

          I'd favor the first option. Placing all the queue configuration in one file (resource-manager-conf.xml) makes it easier for someone to configure queues. They shouldn't care about what configuration is scheduler-specific and what is not. The individual components (JobTracker, Scheduler) can read whatever they need from this one file, or rather, from the one class that represents this file. You don't want this split (each component reading what it wants) to necessarily show up in the config file. If you move code around later, you'd end up moving config values between various files, which is not good.

          Show
          Vivek Ratan added a comment - Considering these, I think we have two options regarding where to specify queues, and scheduler specific queue configuration: I'd favor the first option. Placing all the queue configuration in one file (resource-manager-conf.xml) makes it easier for someone to configure queues. They shouldn't care about what configuration is scheduler-specific and what is not. The individual components (JobTracker, Scheduler) can read whatever they need from this one file, or rather, from the one class that represents this file. You don't want this split (each component reading what it wants) to necessarily show up in the config file. If you move code around later, you'd end up moving config values between various files, which is not good.
          Hide
          Sanjay Radia added a comment -

          Hemant says:
          >In the same discussions, it seemed like sysadmins would like to control this themselves as well. Again, there did not seem a need to set up access control.
          What do you mean by sysadmins? - do they have special priviledges beyond the ACL. I think it makes sense
          for the superuser and the supergroup (of HDFS) to have full permissions to job queues without the need for their username/groupname appear in the ACL. Is that what you meant by sysadmins?

          >That said, it is really simple to add new operations in the design above when we feel we need it. That's one advantage with this approach.
          I suggest we add the "changeACL operation right now; its default list should be null (ie only the superuser/supergroup can modify it).

          Show
          Sanjay Radia added a comment - Hemant says: >In the same discussions, it seemed like sysadmins would like to control this themselves as well. Again, there did not seem a need to set up access control. What do you mean by sysadmins? - do they have special priviledges beyond the ACL. I think it makes sense for the superuser and the supergroup (of HDFS) to have full permissions to job queues without the need for their username/groupname appear in the ACL. Is that what you meant by sysadmins? >That said, it is really simple to add new operations in the design above when we feel we need it. That's one advantage with this approach. I suggest we add the "changeACL operation right now; its default list should be null (ie only the superuser/supergroup can modify it).
          Hide
          Hemanth Yamijala added a comment -

          There seems to be a consensus that managing ACLs should be with the JobTracker. This is Option 1 which I've mentioned above. I'm OK taking this route.

          Like I mentioned in that option, this would mean that queue configuration would be split in two places

          • the generic part (queue names and ACLs, for now)
          • scheduler specific part

          The simplest way of mapping the generic configuration with scheduler specific one is to link them using the queue name in some manner.

          In HADOOP-3479, we introduced a resource-manager-conf.xml and a ResourceManagerConf class where queues and their properties can be specified. For e.g. queue properties specific to the scheduler in HADOOP-3445, are defined in this file, and read by ResourceManagerConf.

          Considering these, I think we have two options regarding where to specify queues, and scheduler specific queue configuration:

          • Continue to use resource-manager-conf.xml and the ResourceManagerConf class. All implementations of schedulers can use this to get list of queues and any other common properties that are applicable to queues. The HADOOP-3445 scheduler can continue to use resource-manager-conf.xml to store additional configuration it needs.
          • Move the configuration of the list of queues, and ACLs to hadoop-default or hadoop-site (using the same format as defined in HADOOP-3479), and have resource-manager-conf.xml to store configuration specific to HADOOP-3445.

          To reflect the generic nature of queues, it seems better to take the second option. But I am open to comments.

          Show
          Hemanth Yamijala added a comment - There seems to be a consensus that managing ACLs should be with the JobTracker. This is Option 1 which I've mentioned above. I'm OK taking this route. Like I mentioned in that option, this would mean that queue configuration would be split in two places the generic part (queue names and ACLs, for now) scheduler specific part The simplest way of mapping the generic configuration with scheduler specific one is to link them using the queue name in some manner. In HADOOP-3479 , we introduced a resource-manager-conf.xml and a ResourceManagerConf class where queues and their properties can be specified. For e.g. queue properties specific to the scheduler in HADOOP-3445 , are defined in this file, and read by ResourceManagerConf . Considering these, I think we have two options regarding where to specify queues, and scheduler specific queue configuration: Continue to use resource-manager-conf.xml and the ResourceManagerConf class. All implementations of schedulers can use this to get list of queues and any other common properties that are applicable to queues. The HADOOP-3445 scheduler can continue to use resource-manager-conf.xml to store additional configuration it needs. Move the configuration of the list of queues, and ACLs to hadoop-default or hadoop-site (using the same format as defined in HADOOP-3479 ), and have resource-manager-conf.xml to store configuration specific to HADOOP-3445 . To reflect the generic nature of queues, it seems better to take the second option. But I am open to comments.
          Hide
          Vivek Ratan added a comment -

          Thinking bit more about this, I am wondering where the responsibility of managing ACLs should be. There seem to be two choices: ...

          I think it should be with the JobTracker. Queues are first-class citizens across the system. JT's submitJob() should look at access control - the job shouldn't even reach the scheduler unless it passes access control. Agree with Owen that we don't need Schedulers to deal with security.

          It's OK for both the JT and Schedulers to read queue-based information. Both can make use of the ResourceManagerConf class.

          Show
          Vivek Ratan added a comment - Thinking bit more about this, I am wondering where the responsibility of managing ACLs should be. There seem to be two choices: ... I think it should be with the JobTracker. Queues are first-class citizens across the system. JT's submitJob() should look at access control - the job shouldn't even reach the scheduler unless it passes access control. Agree with Owen that we don't need Schedulers to deal with security. It's OK for both the JT and Schedulers to read queue-based information. Both can make use of the ResourceManagerConf class.
          Hide
          Owen O'Malley added a comment -

          I think that we should have the security in the framework rather than the scheduler. I think that will provide a better basis for security and prevent schedulers from having to deal with security, which would be problematic and error-prone.

          Show
          Owen O'Malley added a comment - I think that we should have the security in the framework rather than the scheduler. I think that will provide a better basis for security and prevent schedulers from having to deal with security, which would be problematic and error-prone.
          Hide
          Hemanth Yamijala added a comment -

          We could fold the change queue properties in the admin operation you listed above.

          Yes, this is what I meant by 'configuring a queue'. It means changing queue properties. In internal discussions, the consensus was that this operation will be controlled only by administrators of the system itself - and not delegated to others by setting up access control.

          The changeACL is a new operation - the users who are permitted this are effectively the queue owners.

          In the same discussions, it seemed like sysadmins would like to control this themselves as well. Again, there did not seem a need to set up access control.

          That said, it is really simple to add new operations in the design above when we feel we need it. That's one advantage with this approach.

          It seems to me that job operations and file system operations are similar ... Is the analogue right? If yes, we might want to have a similar access control model as the one in file system.

          This was indeed the option we considered at first. There were a couple of reasons, we chose the model I proposed over the file system ACLs model:

          • While it seems easy to map operations like you've said, the problem was more with the roles: the current ACL model of the file system uses Posix like ACLs, which have three roles - owner, group and others. For queue administration, there are the following roles: users, job-administrators, queue-administrators. Each of these roles could be an individual user or users in a group. In particular, the user-group who can submit jobs is different from the job or queue admin groups. Since in the Posix model, there is only one group possible, it was not flexible enough to set up the ACLs as needed.
          • While there is a very good similarity between the file system operations and queue operations, it seems more logical to model queue administration based on how it is typically done in other scheduling / resource-management systems. All of these seem to follow the model I've proposed above.
          • As is seen from this comment, it is easy (more flexible) to add more operations in this model.

          I think ADMINISTER_JOBS sounds too general. We might want to divide it into more specific operations, say DELETE_JOB, SET_JOB_PRIORITY, etc.

          Hmm... it seems to me that a person who needs to do one of these, will in a different situation need to do the other as well. For instance, a job-admin might be asked to bump up the priority of a job, as it is realized that it needs to complete before others. This same admin might be asked to delete a job because it is stuck in front of the queue without progress and impeding the jobs after it. Can you explain your use-case about why you think these should be different ?

          Show
          Hemanth Yamijala added a comment - We could fold the change queue properties in the admin operation you listed above. Yes, this is what I meant by 'configuring a queue'. It means changing queue properties. In internal discussions, the consensus was that this operation will be controlled only by administrators of the system itself - and not delegated to others by setting up access control. The changeACL is a new operation - the users who are permitted this are effectively the queue owners. In the same discussions, it seemed like sysadmins would like to control this themselves as well. Again, there did not seem a need to set up access control. That said, it is really simple to add new operations in the design above when we feel we need it. That's one advantage with this approach. It seems to me that job operations and file system operations are similar ... Is the analogue right? If yes, we might want to have a similar access control model as the one in file system. This was indeed the option we considered at first. There were a couple of reasons, we chose the model I proposed over the file system ACLs model: While it seems easy to map operations like you've said, the problem was more with the roles: the current ACL model of the file system uses Posix like ACLs, which have three roles - owner, group and others. For queue administration, there are the following roles: users, job-administrators, queue-administrators. Each of these roles could be an individual user or users in a group. In particular, the user-group who can submit jobs is different from the job or queue admin groups. Since in the Posix model, there is only one group possible, it was not flexible enough to set up the ACLs as needed. While there is a very good similarity between the file system operations and queue operations, it seems more logical to model queue administration based on how it is typically done in other scheduling / resource-management systems. All of these seem to follow the model I've proposed above. As is seen from this comment, it is easy (more flexible) to add more operations in this model. I think ADMINISTER_JOBS sounds too general. We might want to divide it into more specific operations, say DELETE_JOB, SET_JOB_PRIORITY, etc. Hmm... it seems to me that a person who needs to do one of these, will in a different situation need to do the other as well. For instance, a job-admin might be asked to bump up the priority of a job, as it is realized that it needs to complete before others. This same admin might be asked to delete a job because it is stuck in front of the queue without progress and impeding the jobs after it. Can you explain your use-case about why you think these should be different ?
          Hide
          Tsz Wo Nicholas Sze added a comment -
          • It seems to me that job operations and file system operations are similar:
            job file
            submit/delete job create/delete file
            set job attributes set file attributes
            queue directory
            list queue list directory
            create/delete queue create/delete directory

            Is the analogue right? If yes, we might want to have a similar access control model as the one in file system.

          • I think ADMINISTER_JOBS sounds too general. We might want to divide it into more specific operations, say DELETE_JOB, SET_JOB_PRIORITY, etc.
          Show
          Tsz Wo Nicholas Sze added a comment - It seems to me that job operations and file system operations are similar: job file submit/delete job create/delete file set job attributes set file attributes queue directory list queue list directory create/delete queue create/delete directory Is the analogue right? If yes, we might want to have a similar access control model as the one in file system. I think ADMINISTER_JOBS sounds too general. We might want to divide it into more specific operations, say DELETE_JOB, SET_JOB_PRIORITY, etc.
          Hide
          Sanjay Radia added a comment -

          Hemant wrote:
          > * An ACL can be specified as an (operation, userlist or grouplist) tuple for a given queue.
          > * The operations that seem relevant are:
          > o submit job
          > o administer jobs in a queue - including deleting any job or modifying any job's properties such as it's priority
          > o listing jobs

          We need add permissions to

          • to change queue properties
          • to change the acl list.

          The changeACL is a new operation - the users who are permitted this are effectively the queue owners.
          We could fold the change queue properties in the admin operation you listed above.

          Show
          Sanjay Radia added a comment - Hemant wrote: > * An ACL can be specified as an (operation, userlist or grouplist) tuple for a given queue. > * The operations that seem relevant are: > o submit job > o administer jobs in a queue - including deleting any job or modifying any job's properties such as it's priority > o listing jobs We need add permissions to to change queue properties to change the acl list. The changeACL is a new operation - the users who are permitted this are effectively the queue owners. We could fold the change queue properties in the admin operation you listed above.
          Hide
          Hemanth Yamijala added a comment -

          Thinking bit more about this, I am wondering where the responsibility of managing ACLs should be. There seem to be two choices:

          • In the JobTracker: In this approach, the jobtracker manages queues which will have common configuration like names, ACLs (possibly the only two common ones for now). In addition, different schedulers could have specific properties for queues - for e.g. HADOOP-3445 has capacities, user limits etc. There should be a way to associate the queues defined in the job tracker with their properties in a scheduler.
          • In the scheduler: In this approach, we treat queues as being related to schedulers. The jobtracker would delegate the verification of ACLs to the scheduler instance it is configured with. This approach means that something like the verifyAccess API, along with the QueueOperation enum would be defined in the TaskScheduler interface. Since different schedulers may want to benefit from the same verification mechanism, we could build the verification logic as a separate utility class.

          I am thinking that the second option is better. Logically queues do seem to be related closer to schedulers. Comments ?

          Show
          Hemanth Yamijala added a comment - Thinking bit more about this, I am wondering where the responsibility of managing ACLs should be. There seem to be two choices: In the JobTracker: In this approach, the jobtracker manages queues which will have common configuration like names, ACLs (possibly the only two common ones for now). In addition, different schedulers could have specific properties for queues - for e.g. HADOOP-3445 has capacities, user limits etc. There should be a way to associate the queues defined in the job tracker with their properties in a scheduler. In the scheduler: In this approach, we treat queues as being related to schedulers. The jobtracker would delegate the verification of ACLs to the scheduler instance it is configured with. This approach means that something like the verifyAccess API, along with the QueueOperation enum would be defined in the TaskScheduler interface. Since different schedulers may want to benefit from the same verification mechanism, we could build the verification logic as a separate utility class. I am thinking that the second option is better. Logically queues do seem to be related closer to schedulers. Comments ?
          Hide
          Owen O'Malley added a comment - - edited

          I'd suggest keeping the implementation concrete until we have a more complete story for security. I'd suggest:

          • Make the class concrete rather than abstract & pluggable, until we have a better handle on what we'd need from a plugin.
          • Make it Configured, rather than recoding it.
          • Drop initialize. It can be done in a configure method, if necessary.
          • I think that verifyAccess should take a UGI, since it is a security check.
          • I assume this is only being used in the JobTracker, rather than in the client.
          class QueueAccessVerifier extends Configured {
          
            enum QueueOperation {
              SUBMIT_JOB,
               ADMINISTER_JOBS
              // TODO Add LIST_JOBS when we have a http security story
            }
          
            // verify if the specified operation is allowed on the specified queue
            public boolean verifyAccess(QueueOperation operation, 
                                                         String queue,
                                                         UserGroupInformation user) throws IOException {..}
          }
          
          Show
          Owen O'Malley added a comment - - edited I'd suggest keeping the implementation concrete until we have a more complete story for security. I'd suggest: Make the class concrete rather than abstract & pluggable, until we have a better handle on what we'd need from a plugin. Make it Configured, rather than recoding it. Drop initialize. It can be done in a configure method, if necessary. I think that verifyAccess should take a UGI, since it is a security check. I assume this is only being used in the JobTracker, rather than in the client. class QueueAccessVerifier extends Configured { enum QueueOperation { SUBMIT_JOB, ADMINISTER_JOBS // TODO Add LIST_JOBS when we have a http security story } // verify if the specified operation is allowed on the specified queue public boolean verifyAccess(QueueOperation operation, String queue, UserGroupInformation user) throws IOException {..} }
          Hide
          Hemanth Yamijala added a comment -
          • We could have an implementation on the following lines:
          abstract class QueueAccessVerifier {
          
            enum QueueOperation {
              SUBMIT_JOB,
              LIST_JOBS,
              ADMINISTER_JOBS
              // ..
            }
          
            private Configuration conf;
          
            public void setConfiguration(Configuration conf) {
              this.conf = conf;
            }
          
            public Configuration getConfiguration() {
              return conf;
            }
          
            // basic initialization, for e.g. read up a conf file which has the list
            // of configured groups and users, and so on.
            public abstract void initialize();
          
            // verify if the specified operation is allowed on the specified queue
            public abstract boolean verifyAccess(QueueOperation operation, 
                                                  String queue);
          }
          
          • The class name of a concrete implementation of QueueAccessVerifier can be specified in hadoop's configuration using a new variable, say hadoop.rm.accessverifier, and initialized by the JobTracker.
          • Following this, implementations of methods defined in JobSubmissionProtocol, such as submitJob and killJob, can use the instance of QueueAccessVerifier to check if the operation is allowed or not.
          • In order to get the currently logged in user and his/her groups, we can probably re-use the UserGroupInformation class that HDFS uses for permission checking. If we decide to follow that route, JobClient could set the UGI_PROPERTY_NAME just like DFSClient does, by doing a login.

          Please let me know if this approach seems fine.

          Show
          Hemanth Yamijala added a comment - We could have an implementation on the following lines: abstract class QueueAccessVerifier { enum QueueOperation { SUBMIT_JOB, LIST_JOBS, ADMINISTER_JOBS // .. } private Configuration conf; public void setConfiguration(Configuration conf) { this .conf = conf; } public Configuration getConfiguration() { return conf; } // basic initialization, for e.g. read up a conf file which has the list // of configured groups and users, and so on. public abstract void initialize(); // verify if the specified operation is allowed on the specified queue public abstract boolean verifyAccess(QueueOperation operation, String queue); } The class name of a concrete implementation of QueueAccessVerifier can be specified in hadoop's configuration using a new variable, say hadoop.rm.accessverifier , and initialized by the JobTracker. Following this, implementations of methods defined in JobSubmissionProtocol, such as submitJob and killJob , can use the instance of QueueAccessVerifier to check if the operation is allowed or not. In order to get the currently logged in user and his/her groups, we can probably re-use the UserGroupInformation class that HDFS uses for permission checking. If we decide to follow that route, JobClient could set the UGI_PROPERTY_NAME just like DFSClient does, by doing a login. Please let me know if this approach seems fine.
          Hide
          Hemanth Yamijala added a comment -

          Initial proposal:

          Firstly, as mentioned in http://issues.apache.org/jira/browse/HADOOP-3479?focusedCommentId=12602243#action_12602243, it possibly makes sense to make the access control mechanism something pluggable at a class design level.

          One concrete and simple mechanism is as follows:

          • An ACL can be specified as an (operation, userlist or grouplist) tuple for a given queue.
          • The operations that seem relevant are:
            • submit job
            • administer jobs in a queue - including deleting any job or modifying any job's properties such as it's priority
            • listing jobs

          Currently, administering the queue itself (that is - operations such as adding / deleting or configuring a queue) is an admin only operation which is controlled by means of access to the queue configuration file defined in HADOOP-3479.

          Thoughts ?

          Show
          Hemanth Yamijala added a comment - Initial proposal: Firstly, as mentioned in http://issues.apache.org/jira/browse/HADOOP-3479?focusedCommentId=12602243#action_12602243 , it possibly makes sense to make the access control mechanism something pluggable at a class design level. One concrete and simple mechanism is as follows: An ACL can be specified as an (operation, userlist or grouplist) tuple for a given queue. The operations that seem relevant are: submit job administer jobs in a queue - including deleting any job or modifying any job's properties such as it's priority listing jobs Currently, administering the queue itself (that is - operations such as adding / deleting or configuring a queue) is an admin only operation which is controlled by means of access to the queue configuration file defined in HADOOP-3479 . Thoughts ?
          Hemanth Yamijala made changes -
          Field Original Value New Value
          Link This issue is part of HADOOP-3444 [ HADOOP-3444 ]
          Hemanth Yamijala created issue -

            People

            • Assignee:
              Hemanth Yamijala
              Reporter:
              Hemanth Yamijala
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development