Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-5734

OrgQueue for easy CapacityScheduler queue configuration management

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.9.0, 3.0.0
    • None
    • None
    • Hide
      <!-- markdown -->

      The OrgQueue extension to the capacity scheduler provides a programmatic way to change configurations by providing a REST API that users can call to modify queue configurations. This enables automation of queue configuration management by administrators in the queue's `administer_queue` ACL.
      Show
      <!-- markdown --> The OrgQueue extension to the capacity scheduler provides a programmatic way to change configurations by providing a REST API that users can call to modify queue configurations. This enables automation of queue configuration management by administrators in the queue's `administer_queue` ACL.

    Description

      The current xml based configuration mechanism in CapacityScheduler makes it very inconvenient to apply any changes to the queue configurations. We saw 2 main drawbacks in the file based configuration mechanism:

      1. This makes it very inconvenient to automate queue configuration updates. For example, in our cluster setup, we leverage the queue mapping feature from YARN-2411 to route users to their dedicated organization queues. It could be extremely cumbersome to keep updating the config file to manage the very dynamic mapping between users to organizations.
      2. Even a user has the admin permission on one specific queue, that user is unable to make any queue configuration changes to resize the subqueues, changing queue ACLs, or creating new queues. All these operations need to be performed in a centralized manner by the cluster administrators.

      With these current limitations, we realized the need of a more flexible configuration mechanism that allows queue configurations to be stored and managed more dynamically. We developed the feature internally at LinkedIn which introduces the concept of MutableConfigurationProvider. What it essentially does is to provide a set of configuration mutation APIs that allows queue configurations to be updated externally with a set of REST APIs. When performing the queue configuration changes, the queue ACLs will be honored, which means only queue administrators can make configuration changes to a given queue. MutableConfigurationProvider is implemented as a pluggable interface, and we have one implementation of this interface which is based on Derby embedded database.

      This feature has been deployed at LinkedIn's Hadoop cluster for a year now, and have gone through several iterations of gathering feedbacks from users and improving accordingly. With this feature, cluster administrators are able to automate lots of thequeue configuration management tasks, such as setting the queue capacities to adjust cluster resources between queues based on established resource consumption patterns, or managing updating the user to queue mappings. We have attached our design documentation with this ticket and would like to receive feedbacks from the community regarding how to best integrate it with the latest version of YARN.

      Attachments

        Issue Links

        1.
        Create YarnConfigurationStore interface and InMemoryConfigurationStore class Sub-task Resolved Jonathan Hung   Actions
        2.
        Create LeveldbConfigurationStore class using Leveldb as backing store Sub-task Resolved Jonathan Hung   Actions
        3.
        Implement MutableConfigurationManager for handling storage into configuration store Sub-task Resolved Jonathan Hung   Actions
        4.
        Add pluggable configuration ACL policy interface and implementation Sub-task Resolved Jonathan Hung   Actions
        5.
        Create StoreConfigurationProvider to construct a Configuration from the backing store Sub-task Resolved Unassigned   Actions
        6.
        Changes to allow CapacityScheduler to use configuration store Sub-task Resolved Jonathan Hung   Actions
        7.
        Create REST API for changing YARN scheduler configurations Sub-task Resolved Jonathan Hung   Actions
        8.
        Create CLI for changing YARN configurations Sub-task Resolved Jonathan Hung   Actions
        9.
        Implement a CapacityScheduler policy for configuration changes Sub-task Resolved Jonathan Hung   Actions
        10.
        Protocol for scheduler configuration changes between client and RM Sub-task Resolved Jonathan Hung   Actions
        11.
        Disable queue refresh when configuration mutation is enabled Sub-task Resolved Jonathan Hung   Actions
        12.
        Support global configuration mutation in MutableConfProvider Sub-task Resolved Jonathan Hung   Actions
        13.
        Support for adding and removing queue mappings Sub-task Patch Available Jonathan Hung   Actions
        14.
        Add ability to export scheduler configuration XML Sub-task Resolved Jonathan Hung   Actions
        15.
        Implement zookeeper based store for scheduler configuration updates Sub-task Resolved Jonathan Hung   Actions
        16.
        Fix issues on recovery in LevelDB store Sub-task Resolved Jonathan Hung   Actions
        17.
        Add closing logic to configuration store Sub-task Resolved Jonathan Hung   Actions
        18.
        Documentation for API based scheduler configuration management Sub-task Resolved Jonathan Hung   Actions
        19.
        Merge YARN-5734 to trunk/branch-2 Sub-task Resolved Jonathan Hung   Actions
        20.
        Misc changes to YARN-5734 Sub-task Resolved Jonathan Hung   Actions
        21.
        Removing queue then failing over results in exception Sub-task Resolved Jonathan Hung   Actions
        22.
        Merge YARN-5734 branch to trunk branch Sub-task Resolved Xuan Gong   Actions
        23.
        Merge YARN-5734 branch to branch-3.0 Sub-task Resolved Xuan Gong   Actions
        24.
        Merge YARN-5734 branch to branch-2 Sub-task Resolved Xuan Gong   Actions
        25.
        Add file system based scheduler configuration store Sub-task Resolved Jiandan Yang   Actions
        26.
        Queue Management API - no errors thrown for wrong properties Sub-task Open Prabhu Joseph   Actions
        27.
        Queue Management API - rephrase error messages Sub-task Resolved Prabhu Joseph   Actions
        28.
        Queue Management API - not returning JSON or XML response data when passing Accept header Sub-task Resolved Akhil PB   Actions
        29.
        Queue Management API - GET /scheduler-conf API returns 500 Internal Server Error Sub-task Open Unassigned   Actions
        30.
        RMWebServices /scheduler-conf GET returns all hadoop configurations for ZKConfigurationStore Sub-task Resolved Prabhu Joseph   Actions
        31.
        SchedulerConf Mutation API does not Allow Stop and Remove Queue in a single call Sub-task Resolved Prabhu Joseph   Actions
        32.
        SchedConfCli to get current stored scheduler configuration Sub-task Resolved Prabhu Joseph   Actions
        33.
        Queue Management API does not support parallel updates Sub-task Resolved Prabhu Joseph   Actions
        34.
        Disable Option for Write Ahead Logs of LogMutation Sub-task Resolved Prabhu Joseph   Actions
        35.
        Queue Mutation API does not allow to remove a config Sub-task Resolved Prabhu Joseph   Actions
        36.
        Document examples of SchedulerConf with Node Labels Sub-task Resolved Prabhu Joseph   Actions
        37.
        SchedConfCli does not work with https mode Sub-task Resolved Prabhu Joseph   Actions
        38.
        Format CS Configuration present in Configuration Store Sub-task Resolved Prabhu Joseph   Actions
        39.
        Mutation API Config Change need to update Version Number Sub-task Resolved Prabhu Joseph   Actions
        40.
        Remove unnecessary LevelDb write call in LeveldbConfigurationStore#confirmMutation Sub-task Resolved Ashutosh Gupta

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 40m
        Actions
        41.
        FSSchedulerConfigurationStore fails to update with hdfs path Sub-task Resolved Prabhu Joseph

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 50m
        Actions
        42.
        Revert to previous state when Invalid Config is applied and Refresh Support in SchedulerConfig Format Sub-task Closed Prabhu Joseph   Actions
        43.
        Offline format of YarnConfigurationStore Sub-task Resolved Prabhu Joseph   Actions
        44.
        Unset Ordering Policy of Leaf/Parent queue converted from Parent/Leaf queue respectively Sub-task Resolved Prabhu Joseph   Actions
        45.
        Allow stop and convert from leaf to parent queue in a single Mutation API call Sub-task Resolved Prabhu Joseph   Actions
        46.
        Periodically sync backend scheduler configuration changes to capacity-scheduler.xml Sub-task Open Unassigned   Actions
        47.
        Create RM Rest API to validate a CapacityScheduler Configuration Sub-task Resolved Kinga Marton   Actions
        48.
        ValidateAndGetSchedulerConfiguration API fails when cluster max allocation > default 8GB Sub-task Resolved Prabhu Joseph   Actions
        49.
        YARN RMWebServices /scheduler-conf/validate leaks ZK Connections Sub-task Resolved Prabhu Joseph   Actions
        50.
        Update scheduler-conf corrupts the CS configuration when removing queue which is referred in queue mapping Sub-task Resolved Ashutosh Gupta

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 50m
        Actions

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            mshen Min Shen Assign to me
            mshen Min Shen
            Votes:
            2 Vote for this issue
            Watchers:
            43 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - Not Specified
              Not Specified
              Remaining:
              Remaining Estimate - 0h
              0h
              Logged:
              Time Spent - 2h 20m
              2h 20m

              Slack

                Issue deployment