Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-149

[Umbrella] ResourceManager (RM) Fail-over

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: resourcemanager
    • Labels:
    • Target Version/s:

      Description

      This jira tracks work needed to be done to support one RM instance failing over to another RM instance so that we can have RM HA. Work includes leader election, transfer of control to leader and client re-direction to new leader.

        Attachments

        1. YARN ResourceManager Automatic Failover-rev-08-04-13.pdf
          207 kB
          Bikas Saha
        2. YARN ResourceManager Automatic Failover-rev-07-21-13.pdf
          207 kB
          Bikas Saha
        3. rm-ha-phase1-draft2.pdf
          170 kB
          Karthik Kambatla
        4. rm-ha-phase1-approach-draft1.pdf
          165 kB
          Karthik Kambatla

          Issue Links

          1.
          Separate out RM services into "Always On" and "Active" Sub-task Closed Karthik Kambatla  
          2.
          Implement RMHAProtocolService Sub-task Closed Karthik Kambatla  
          3.
          Test and verify ACL based ZKRMStateStore fencing for RM State Store Sub-task Resolved Karthik Kambatla  
          4.
          Add FailoverProxyProvider like capability to RMProxy Sub-task Closed Karthik Kambatla  
          5.
          Allow embedding leader election into the RM Sub-task Closed Karthik Kambatla  
          6.
          Expose RM active/standby state to Web UI and REST API Sub-task Closed Karthik Kambatla  
          7.
          Add admin support for HA operations Sub-task Closed Karthik Kambatla  
          8.
          Revisit exception handling in ZKRMStateStore post RM HA Sub-task Resolved Unassigned  
          9.
          Add shutdown support to non-service RM components Sub-task Open Xuan Gong  
          10.
          Support automatic failover using ZKFC Sub-task Open Unassigned  
          11.
          Add end-to-end tests for HA Sub-task Open Xuan Gong  
          12.
          Move init() of activeServices to ResourceManager#serviceStart() Sub-task Resolved Karthik Kambatla  
          13.
          Augment MiniYARNCluster to support HA mode Sub-task Closed Karthik Kambatla  
          14.
          Update HAServiceState to STOPPING on RM#stop() Sub-task Resolved Karthik Kambatla  
          15.
          ResourceManger.clusterTimeStamp should be reset when RM transitions to active Sub-task Resolved Unassigned  
          16.
          Verify RM HA works in secure clusters Sub-task Resolved Unassigned  
          17.
          Make improvements in ZKRMStateStore for fencing Sub-task Closed Karthik Kambatla  
          18.
          RM DT token service should have service addresses of both RMs Sub-task Closed Karthik Kambatla  
          19.
          Configuration to support multiple RMs Sub-task Closed Karthik Kambatla  
          20.
          RMHAProtocolService#serviceInit should handle HAUtil's IllegalArgumentException Sub-task Closed Tsuyoshi Ozawa  
          21.
          Promote AdminService to an Always-On service and merge in RMHAProtocolService Sub-task Closed Karthik Kambatla  
          22.
          Set HTTPS webapp address along with other RPC addresses in HAUtil Sub-task Closed Karthik Kambatla  
          23.
          Enabling HA should check Configuration contains multiple RMs Sub-task Closed Xuan Gong  
          24.
          RM should log using RMStore at startup time Sub-task Closed Tsuyoshi Ozawa

          0%

          Original Estimate - 3h
          Remaining Estimate - 3h
          25.
          Handle RM fails over after getApplicationID() and before submitApplication(). Sub-task Closed Xuan Gong

          0%

          Original Estimate - 48h
          Remaining Estimate - 48h
          26.
          HA config shouldn't affect NodeManager RPC addresses Sub-task Closed Karthik Kambatla  
          27.
          RM services should depend on ConfigurationProvider during startup too Sub-task Closed Xuan Gong  
          28.
          Move internal services logic from AdminService to ResourceManager Sub-task Closed Vinod Kumar Vavilapalli  
          29.
          WebApplicationProxy should be always-on w.r.t HA even if it is embedded in the RM Sub-task Closed Xuan Gong  
          30.
          Enabling HA should verify the RM service addresses configurations have been set for every RM Ids defined in RM_HA_IDs Sub-task Closed Xuan Gong  
          31.
          Utility to benchmark ZKRMStateStore#loadState for ResourceManager-HA Sub-task Closed Tsuyoshi Ozawa  
          32.
          Mark appropriate protocol methods with the idempotent annotation or AtMostOnce annotation Sub-task Closed Xuan Gong  
          33.
          Use StandbyException instead of RMNotYetReadyException Sub-task Closed Karthik Kambatla  
          34.
          Web UI should redirect to active RM when HA is enabled. Sub-task Closed Cindy Li  
          35.
          Add an option to yarn rmadmin to clear the znode used by embedded elector Sub-task Open Unassigned  
          36.
          Add an easy way to turn on HA Sub-task Resolved Karthik Kambatla  
          37.
          Race between ServerRMProxy and ClientRMProxy setting RMProxy#INSTANCE Sub-task Closed Karthik Kambatla  
          38.
          ZK store should use a private password for root-node-acls Sub-task Closed Karthik Kambatla  
          39.
          RMDispatcher should be reset on transition to standby Sub-task Closed Xuan Gong  
          40.
          ActiveRMInfoProto fields should be optional Sub-task Closed Karthik Kambatla  
          41.
          Support explicit failover when automatic failover is enabled Sub-task Resolved Karthik Kambatla  
          42.
          HA-related rmadmin commands don't work on a secure cluster Sub-task Closed Karthik Kambatla  
          43.
          Make admin refresh of capacity scheduler configuration work across RM failover Sub-task Closed Xuan Gong  
          44.
          YARM RM HA requires different configs on different RM hosts Sub-task Closed Xuan Gong  
          45.
          Manual Failover does not work in secure clusters Sub-task Closed Xuan Gong  
          46.
          ZK store should attempt a write periodically to ensure it is still Active Sub-task Closed Karthik Kambatla  
          47.
          RMDTRenewer#getRMClient should use ClientRMProxy Sub-task Closed Karthik Kambatla  
          48.
          Webservice should redirect to active RM when HA is enabled. Sub-task Closed Cindy Li  
          49.
          add the ability to set yarn.resourcemanager.hostname.rm-id instead of setting all the various host:port properties for RM Sub-task Closed Xuan Gong  
          50.
          Set better defaults for HA configs for automatic failover Sub-task Closed Xuan Gong  
          51.
          Make admin refreshNodes work across RM failover Sub-task Closed Xuan Gong  
          52.
          Make admin refreshSuperUserGroupsConfiguration work across RM failover Sub-task Closed Xuan Gong  
          53.
          Make admin refreshAdminAcls work across RM failover Sub-task Closed Xuan Gong  
          54.
          Make admin refreshServiceAcls work across RM failover Sub-task Closed Xuan Gong  
          55.
          Make admin refreshUserToGroupsMappings of configuration work across RM failover Sub-task Closed Xuan Gong  
          56.
          Make admin refresh of Fair scheduler configuration work across RM failover Sub-task Open Xuan Gong  
          57.
          Cleanup YARN HAUtil class Sub-task Open Vinod Kumar Vavilapalli  
          58.
          Document RM HA Sub-task Closed Tsuyoshi Ozawa  
          59.
          Reset cluster-metrics on transition to standby Sub-task Closed Rohith Sharma K S  
          60.
          RM should get the updated Configurations when it transits from Standby to Active Sub-task Closed Xuan Gong  
          61.
          RMAdminCLI should check whether HA is enabled before executes transitionToActive/transitionToStandby Sub-task Closed Xuan Gong  
          62.
          Handle RM failovers during the submitApplication call. Sub-task Resolved Xuan Gong  
          63.
          Handle RM fail overs after the submitApplication call. Sub-task Closed Xuan Gong  
          64.
          Write test cases to verify that killApplication API works in RM HA Sub-task Closed Xuan Gong  
          65.
          When RM does the initiation, it should use loaded Configuration instead of bootstrap configuration. Sub-task Closed Xuan Gong  
          66.
          renewDelegationToken should survive RM failover Sub-task Closed Zhijie Shen  
          67.
          Handle AMRMTokens across RM failover Sub-task Closed Jian He  
          68.
          RM HA: AM link broken if the AM is on nodes other than RM Sub-task Closed Robert Kanter  
          69.
          Add retry cache support in ResourceManager Sub-task Resolved Tsuyoshi Ozawa  
          70.
          Persist ClusterMetrics across RM HA transitions Sub-task Open Unassigned  
          71.
          cancelDelegationToken should survive RM failover Sub-task Open Zhijie Shen  
          72.
          Both RM stuck in standby mode when automatic failover is enabled Sub-task Closed Karthik Kambatla  
          73.
          Document yarn.resourcemanager.zk-auth and its scope Sub-task Closed Robert Kanter  
          74.
          Implement and verify Scheduler#moveApplication() idempotent for CapacityScheduler/FairScheduler Sub-task Open Xuan Gong  
          75.
          Make ApplicationMasterProtocol#allocate AtMostOnce Sub-task Closed Xuan Gong  
          76.
          Add testcases to test AMRMToken on HA Sub-task Resolved Xuan Gong  
          77.
          Standby RM's conf, stacks, logLevel, metrics, jmx and logs links are redirecting to Active RM Sub-task Closed Xuan Gong  
          78.
          Yarn standby RM taking long to transition to active Sub-task Open Xuan Gong  
          79.
          Aggregation of MR job logs failing when Resourcemanager switches Sub-task Resolved Wangda Tan  
          80.
          NM-Local dir cleanup failing when Resourcemanager switches Sub-task Open Unassigned  
          81.
          Option "--forceactive" not works as described in usage of "yarn rmadmin -transitionToActive" Sub-task Closed Masatake Iwasaki  
          82.
          [RM HA] Rest api endpoints doing redirect incorrectly Sub-task Closed Xuan Gong  
          83.
          Improve the error message when attempting manual failover with auto-failover enabled Sub-task Closed Akira Ajisaka  
          84.
          forcemanual transitionToStandby in RM-HA automatic-failover mode should change elector state Sub-task Open Masatake Iwasaki  
          85.
          Documentation of ResourceManager HA should explain configurations about listen addresses Sub-task Closed Masatake Iwasaki  
          86.
          Both RM in active state when Admin#transitionToActive failure from refeshAll() Sub-task Closed Bibin A Chundatt  
          87.
          RM HA UI redirection needs to be fixed when both RMs are in standby mode Sub-task Closed Xuan Gong  
          88.
          RM should print alert messages if Zookeeper and Resourcemanager gets connection issue Sub-task Closed Xuan Gong  
          89.
          Both RM becomes Active if all zookeepers can not connect to active RM Sub-task Resolved Xuan Gong  
          90.
          Add retry on establishing Zookeeper conenction in EmbeddedElectorService#serviceInit Sub-task Resolved Xuan Gong  

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                qwertymaniac Harsh J
              • Votes:
                4 Vote for this issue
                Watchers:
                84 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - 51h
                  51h
                  Remaining:
                  Remaining Estimate - 51h
                  51h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified