Hadoop YARN
  1. Hadoop YARN
  2. YARN-149

ResourceManager (RM) High-Availability (HA)

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 2.4.0
    • Fix Version/s: None
    • Component/s: resourcemanager
    • Labels:
    • Target Version/s:

      Description

      This jira tracks work needed to be done to support one RM instance failing over to another RM instance so that we can have RM HA. Work includes leader election, transfer of control to leader and client re-direction to new leader.

      1. rm-ha-phase1-approach-draft1.pdf
        165 kB
        Karthik Kambatla
      2. rm-ha-phase1-draft2.pdf
        170 kB
        Karthik Kambatla
      3. YARN ResourceManager Automatic Failover-rev-07-21-13.pdf
        207 kB
        Bikas Saha
      4. YARN ResourceManager Automatic Failover-rev-08-04-13.pdf
        207 kB
        Bikas Saha

        Issue Links

        1.
        Separate out RM services into "Always On" and "Active" Sub-task Closed Karthik Kambatla  
         
        2.
        Implement RMHAProtocolService Sub-task Closed Karthik Kambatla  
         
        3.
        Test and verify ACL based ZKRMStateStore fencing for RM State Store Sub-task Resolved Karthik Kambatla  
         
        4.
        Add FailoverProxyProvider like capability to RMProxy Sub-task Closed Karthik Kambatla  
         
        5.
        Allow embedding leader election into the RM Sub-task Closed Karthik Kambatla  
         
        6.
        Expose RM active/standby state to Web UI and REST API Sub-task Closed Karthik Kambatla  
         
        7.
        Add admin support for HA operations Sub-task Closed Karthik Kambatla  
         
        8.
        Revisit exception handling in ZKRMStateStore post RM HA Sub-task Resolved Unassigned  
         
        9. Add shutdown support to non-service RM components Sub-task Open Xuan Gong  
         
        10. Support automatic failover using ZKFC Sub-task Open Karthik Kambatla  
         
        11. Add end-to-end tests for HA Sub-task Open Xuan Gong  
         
        12.
        Move init() of activeServices to ResourceManager#serviceStart() Sub-task Resolved Karthik Kambatla  
         
        13.
        Augment MiniYARNCluster to support HA mode Sub-task Closed Karthik Kambatla  
         
        14.
        Update HAServiceState to STOPPING on RM#stop() Sub-task Resolved Karthik Kambatla  
         
        15.
        ResourceManger.clusterTimeStamp should be reset when RM transitions to active Sub-task Resolved Unassigned  
         
        16.
        Verify RM HA works in secure clusters Sub-task Resolved Unassigned  
         
        17.
        Make improvements in ZKRMStateStore for fencing Sub-task Closed Karthik Kambatla  
         
        18.
        RM DT token service should have service addresses of both RMs Sub-task Closed Karthik Kambatla  
         
        19.
        Configuration to support multiple RMs Sub-task Closed Karthik Kambatla  
         
        20.
        RMHAProtocolService#serviceInit should handle HAUtil's IllegalArgumentException Sub-task Closed Tsuyoshi OZAWA  
         
        21.
        Promote AdminService to an Always-On service and merge in RMHAProtocolService Sub-task Closed Karthik Kambatla  
         
        22.
        Set HTTPS webapp address along with other RPC addresses in HAUtil Sub-task Closed Karthik Kambatla  
         
        23.
        Enabling HA should check Configuration contains multiple RMs Sub-task Closed Xuan Gong  
         
        24. RM should log using RMStore at startup time Sub-task Open Tsuyoshi OZAWA

        0%

        Original Estimate - 3h
        Remaining Estimate - 3h
         
        25.
        Handle RM fails over after getApplicationID() and before submitApplication(). Sub-task Closed Xuan Gong

        0%

        Original Estimate - 48h
        Remaining Estimate - 48h
         
        26.
        HA config shouldn't affect NodeManager RPC addresses Sub-task Closed Karthik Kambatla  
         
        27.
        RM services should depend on ConfigurationProvider during startup too Sub-task Closed Xuan Gong  
         
        28.
        Move internal services logic from AdminService to ResourceManager Sub-task Closed Vinod Kumar Vavilapalli  
         
        29.
        WebApplicationProxy should be always-on w.r.t HA even if it is embedded in the RM Sub-task Closed Xuan Gong  
         
        30.
        Enabling HA should verify the RM service addresses configurations have been set for every RM Ids defined in RM_HA_IDs Sub-task Closed Xuan Gong  
         
        31. Utility to benchmark ZKRMStateStore#loadState for ResourceManager-HA Sub-task Patch Available Tsuyoshi OZAWA  
         
        32.
        Mark appropriate protocol methods with the idempotent annotation or AtMostOnce annotation Sub-task Closed Xuan Gong  
         
        33.
        Use StandbyException instead of RMNotYetReadyException Sub-task Closed Karthik Kambatla  
         
        34.
        Web UI should redirect to active RM when HA is enabled. Sub-task Closed Cindy Li  
         
        35. Add an option to yarn rmadmin to clear the znode used by embedded elector Sub-task Open Karthik Kambatla  
         
        36.
        Add an easy way to turn on HA Sub-task Resolved Karthik Kambatla  
         
        37.
        Race between ServerRMProxy and ClientRMProxy setting RMProxy#INSTANCE Sub-task Closed Karthik Kambatla  
         
        38.
        ZK store should use a private password for root-node-acls Sub-task Closed Karthik Kambatla  
         
        39.
        RMDispatcher should be reset on transition to standby Sub-task Closed Xuan Gong  
         
        40.
        ActiveRMInfoProto fields should be optional Sub-task Closed Karthik Kambatla  
         
        41. Support explicit failover when automatic failover is enabled Sub-task Open Karthik Kambatla  
         
        42.
        HA-related rmadmin commands don't work on a secure cluster Sub-task Closed Karthik Kambatla  
         
        43.
        Make admin refresh of capacity scheduler configuration work across RM failover Sub-task Closed Xuan Gong  
         
        44.
        YARM RM HA requires different configs on different RM hosts Sub-task Closed Xuan Gong  
         
        45.
        Manual Failover does not work in secure clusters Sub-task Closed Xuan Gong  
         
        46.
        ZK store should attempt a write periodically to ensure it is still Active Sub-task Closed Karthik Kambatla  
         
        47.
        RMDTRenewer#getRMClient should use ClientRMProxy Sub-task Closed Karthik Kambatla  
         
        48.
        Webservice should redirect to active RM when HA is enabled. Sub-task Closed Cindy Li  
         
        49.
        add the ability to set yarn.resourcemanager.hostname.rm-id instead of setting all the various host:port properties for RM Sub-task Closed Xuan Gong  
         
        50.
        Set better defaults for HA configs for automatic failover Sub-task Closed Xuan Gong  
         
        51.
        Make admin refreshNodes work across RM failover Sub-task Closed Xuan Gong  
         
        52.
        Make admin refreshSuperUserGroupsConfiguration work across RM failover Sub-task Closed Xuan Gong  
         
        53.
        Make admin refreshAdminAcls work across RM failover Sub-task Closed Xuan Gong  
         
        54.
        Make admin refreshServiceAcls work across RM failover Sub-task Closed Xuan Gong  
         
        55.
        Make admin refreshUserToGroupsMappings of configuration work across RM failover Sub-task Closed Xuan Gong  
         
        56. Make admin refresh of Fair scheduler configuration work across RM failover Sub-task Open Xuan Gong  
         
        57. Cleanup YARN HAUtil class Sub-task Open Vinod Kumar Vavilapalli  
         
        58.
        Document RM HA Sub-task Closed Tsuyoshi OZAWA  
         
        59.
        Reset cluster-metrics on transition to standby Sub-task Resolved Rohith  
         
        60.
        RM should get the updated Configurations when it transits from Standby to Active Sub-task Closed Xuan Gong  
         
        61.
        RMAdminCLI should check whether HA is enabled before executes transitionToActive/transitionToStandby Sub-task Closed Xuan Gong  
         
        62.
        Handle RM failovers during the submitApplication call. Sub-task Resolved Xuan Gong  
         
        63.
        Handle RM fail overs after the submitApplication call. Sub-task Closed Xuan Gong  
         
        64.
        Write test cases to verify that killApplication API works in RM HA Sub-task Closed Xuan Gong  
         
        65.
        When RM does the initiation, it should use loaded Configuration instead of bootstrap configuration. Sub-task Closed Xuan Gong  
         
        66.
        renewDelegationToken should survive RM failover Sub-task Closed Zhijie Shen  
         
        67. Handle AMRMTokens across RM failover Sub-task Open Unassigned  
         
        68.
        RM HA: AM link broken if the AM is on nodes other than RM Sub-task Closed Robert Kanter  
         
        69. Add retry cache support in ResourceManager Sub-task Open Tsuyoshi OZAWA  
         
        70. Persist ClusterMetrics across RM HA transitions Sub-task Open Unassigned  
         
        71. cancelDelegationToken should survive RM failover Sub-task Open Zhijie Shen  
         
        72.
        Both RM stuck in standby mode when automatic failover is enabled Sub-task Closed Karthik Kambatla  
         
        73.
        Document yarn.resourcemanager.zk-auth and its scope Sub-task Resolved Robert Kanter  
         
        74. Implement and verify Scheduler#moveApplication() idempotent for CapacityScheduler/FairScheduler Sub-task Open Xuan Gong  
         
        75.
        Make ApplicationMasterProtocol#allocate AtMostOnce Sub-task Closed Xuan Gong  
         
        76.
        Add testcases to test AMRMToken on HA Sub-task Resolved Xuan Gong  
         
        77.
        Standby RM's conf, stacks, logLevel, metrics, jmx and logs links are redirecting to Active RM Sub-task Closed Xuan Gong  
         
        78. Yarn standby RM taking long to transition to active Sub-task Patch Available Xuan Gong  
         
        79.
        Aggregation of MR job logs failing when Resourcemanager switches Sub-task Resolved Wangda Tan  
         
        80. NM-Local dir cleanup failing when Resourcemanager switches Sub-task Open Unassigned  
         

          Activity

          No work has yet been logged on this issue.

            People

            • Assignee:
              Unassigned
              Reporter:
              Harsh J
            • Votes:
              3 Vote for this issue
              Watchers:
              78 Start watching this issue

              Dates

              • Created:
                Updated:

                Time Tracking

                Estimated:
                Original Estimate - 51h
                51h
                Remaining:
                Remaining Estimate - 51h
                51h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Development