Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-149

[Umbrella] ResourceManager (RM) Fail-over

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: resourcemanager
    • Labels:
    • Target Version/s:

      Description

      This jira tracks work needed to be done to support one RM instance failing over to another RM instance so that we can have RM HA. Work includes leader election, transfer of control to leader and client re-direction to new leader.

        Attachments

        1. rm-ha-phase1-approach-draft1.pdf
          165 kB
          Karthik Kambatla
        2. rm-ha-phase1-draft2.pdf
          170 kB
          Karthik Kambatla
        3. YARN ResourceManager Automatic Failover-rev-07-21-13.pdf
          207 kB
          Bikas Saha
        4. YARN ResourceManager Automatic Failover-rev-08-04-13.pdf
          207 kB
          Bikas Saha

        Issue Links

        1.
        Separate out RM services into "Always On" and "Active" Sub-task Closed Karthik Kambatla   Actions
        2.
        Implement RMHAProtocolService Sub-task Closed Karthik Kambatla   Actions
        3.
        Test and verify ACL based ZKRMStateStore fencing for RM State Store Sub-task Resolved Karthik Kambatla   Actions
        4.
        Add FailoverProxyProvider like capability to RMProxy Sub-task Closed Karthik Kambatla   Actions
        5.
        Allow embedding leader election into the RM Sub-task Closed Karthik Kambatla   Actions
        6.
        Expose RM active/standby state to Web UI and REST API Sub-task Closed Karthik Kambatla   Actions
        7.
        Add admin support for HA operations Sub-task Closed Karthik Kambatla   Actions
        8.
        Revisit exception handling in ZKRMStateStore post RM HA Sub-task Resolved Unassigned   Actions
        9.
        Add shutdown support to non-service RM components Sub-task Open Xuan Gong   Actions
        10.
        Support automatic failover using ZKFC Sub-task Open Unassigned   Actions
        11.
        Add end-to-end tests for HA Sub-task Open Xuan Gong   Actions
        12.
        Move init() of activeServices to ResourceManager#serviceStart() Sub-task Resolved Karthik Kambatla   Actions
        13.
        Augment MiniYARNCluster to support HA mode Sub-task Closed Karthik Kambatla   Actions
        14.
        Update HAServiceState to STOPPING on RM#stop() Sub-task Resolved Karthik Kambatla   Actions
        15.
        ResourceManger.clusterTimeStamp should be reset when RM transitions to active Sub-task Resolved Unassigned   Actions
        16.
        Verify RM HA works in secure clusters Sub-task Resolved Unassigned   Actions
        17.
        Make improvements in ZKRMStateStore for fencing Sub-task Closed Karthik Kambatla   Actions
        18.
        RM DT token service should have service addresses of both RMs Sub-task Closed Karthik Kambatla   Actions
        19.
        Configuration to support multiple RMs Sub-task Closed Karthik Kambatla   Actions
        20.
        RMHAProtocolService#serviceInit should handle HAUtil's IllegalArgumentException Sub-task Closed Tsuyoshi Ozawa   Actions
        21.
        Promote AdminService to an Always-On service and merge in RMHAProtocolService Sub-task Closed Karthik Kambatla   Actions
        22.
        Set HTTPS webapp address along with other RPC addresses in HAUtil Sub-task Closed Karthik Kambatla   Actions
        23.
        Enabling HA should check Configuration contains multiple RMs Sub-task Closed Xuan Gong   Actions
        24.
        RM should log using RMStore at startup time Sub-task Closed Tsuyoshi Ozawa

        0%

        Original Estimate - 3h
        Remaining Estimate - 3h
        Actions
        25.
        Handle RM fails over after getApplicationID() and before submitApplication(). Sub-task Closed Xuan Gong

        0%

        Original Estimate - 48h
        Remaining Estimate - 48h
        Actions
        26.
        HA config shouldn't affect NodeManager RPC addresses Sub-task Closed Karthik Kambatla   Actions
        27.
        RM services should depend on ConfigurationProvider during startup too Sub-task Closed Xuan Gong   Actions
        28.
        Move internal services logic from AdminService to ResourceManager Sub-task Closed Vinod Kumar Vavilapalli   Actions
        29.
        WebApplicationProxy should be always-on w.r.t HA even if it is embedded in the RM Sub-task Closed Xuan Gong   Actions
        30.
        Enabling HA should verify the RM service addresses configurations have been set for every RM Ids defined in RM_HA_IDs Sub-task Closed Xuan Gong   Actions
        31.
        Utility to benchmark ZKRMStateStore#loadState for ResourceManager-HA Sub-task Closed Tsuyoshi Ozawa   Actions
        32.
        Mark appropriate protocol methods with the idempotent annotation or AtMostOnce annotation Sub-task Closed Xuan Gong   Actions
        33.
        Use StandbyException instead of RMNotYetReadyException Sub-task Closed Karthik Kambatla   Actions
        34.
        Web UI should redirect to active RM when HA is enabled. Sub-task Closed Cindy Li   Actions
        35.
        Add an option to yarn rmadmin to clear the znode used by embedded elector Sub-task Open Unassigned   Actions
        36.
        Add an easy way to turn on HA Sub-task Resolved Karthik Kambatla   Actions
        37.
        Race between ServerRMProxy and ClientRMProxy setting RMProxy#INSTANCE Sub-task Closed Karthik Kambatla   Actions
        38.
        ZK store should use a private password for root-node-acls Sub-task Closed Karthik Kambatla   Actions
        39.
        RMDispatcher should be reset on transition to standby Sub-task Closed Xuan Gong   Actions
        40.
        ActiveRMInfoProto fields should be optional Sub-task Closed Karthik Kambatla   Actions
        41.
        Support explicit failover when automatic failover is enabled Sub-task Resolved Karthik Kambatla   Actions
        42.
        HA-related rmadmin commands don't work on a secure cluster Sub-task Closed Karthik Kambatla   Actions
        43.
        Make admin refresh of capacity scheduler configuration work across RM failover Sub-task Closed Xuan Gong   Actions
        44.
        YARM RM HA requires different configs on different RM hosts Sub-task Closed Xuan Gong   Actions
        45.
        Manual Failover does not work in secure clusters Sub-task Closed Xuan Gong   Actions
        46.
        ZK store should attempt a write periodically to ensure it is still Active Sub-task Closed Karthik Kambatla   Actions
        47.
        RMDTRenewer#getRMClient should use ClientRMProxy Sub-task Closed Karthik Kambatla   Actions
        48.
        Webservice should redirect to active RM when HA is enabled. Sub-task Closed Cindy Li   Actions
        49.
        add the ability to set yarn.resourcemanager.hostname.rm-id instead of setting all the various host:port properties for RM Sub-task Closed Xuan Gong   Actions
        50.
        Set better defaults for HA configs for automatic failover Sub-task Closed Xuan Gong   Actions
        51.
        Make admin refreshNodes work across RM failover Sub-task Closed Xuan Gong   Actions
        52.
        Make admin refreshSuperUserGroupsConfiguration work across RM failover Sub-task Closed Xuan Gong   Actions
        53.
        Make admin refreshAdminAcls work across RM failover Sub-task Closed Xuan Gong   Actions
        54.
        Make admin refreshServiceAcls work across RM failover Sub-task Closed Xuan Gong   Actions
        55.
        Make admin refreshUserToGroupsMappings of configuration work across RM failover Sub-task Closed Xuan Gong   Actions
        56.
        Make admin refresh of Fair scheduler configuration work across RM failover Sub-task Open Xuan Gong   Actions
        57.
        Cleanup YARN HAUtil class Sub-task Open Vinod Kumar Vavilapalli   Actions
        58.
        Document RM HA Sub-task Closed Tsuyoshi Ozawa   Actions
        59.
        Reset cluster-metrics on transition to standby Sub-task Closed Rohith Sharma K S   Actions
        60.
        RM should get the updated Configurations when it transits from Standby to Active Sub-task Closed Xuan Gong   Actions
        61.
        RMAdminCLI should check whether HA is enabled before executes transitionToActive/transitionToStandby Sub-task Closed Xuan Gong   Actions
        62.
        Handle RM failovers during the submitApplication call. Sub-task Resolved Xuan Gong   Actions
        63.
        Handle RM fail overs after the submitApplication call. Sub-task Closed Xuan Gong   Actions
        64.
        Write test cases to verify that killApplication API works in RM HA Sub-task Closed Xuan Gong   Actions
        65.
        When RM does the initiation, it should use loaded Configuration instead of bootstrap configuration. Sub-task Closed Xuan Gong   Actions
        66.
        renewDelegationToken should survive RM failover Sub-task Closed Zhijie Shen   Actions
        67.
        Handle AMRMTokens across RM failover Sub-task Closed Jian He   Actions
        68.
        RM HA: AM link broken if the AM is on nodes other than RM Sub-task Closed Robert Kanter   Actions
        69.
        Add retry cache support in ResourceManager Sub-task Resolved Tsuyoshi Ozawa   Actions
        70.
        Persist ClusterMetrics across RM HA transitions Sub-task Open Unassigned   Actions
        71.
        cancelDelegationToken should survive RM failover Sub-task Open Zhijie Shen   Actions
        72.
        Both RM stuck in standby mode when automatic failover is enabled Sub-task Closed Karthik Kambatla   Actions
        73.
        Document yarn.resourcemanager.zk-auth and its scope Sub-task Closed Robert Kanter   Actions
        74.
        Implement and verify Scheduler#moveApplication() idempotent for CapacityScheduler/FairScheduler Sub-task Open Xuan Gong   Actions
        75.
        Make ApplicationMasterProtocol#allocate AtMostOnce Sub-task Closed Xuan Gong   Actions
        76.
        Add testcases to test AMRMToken on HA Sub-task Resolved Xuan Gong   Actions
        77.
        Standby RM's conf, stacks, logLevel, metrics, jmx and logs links are redirecting to Active RM Sub-task Closed Xuan Gong   Actions
        78.
        Yarn standby RM taking long to transition to active Sub-task Open Xuan Gong   Actions
        79.
        Aggregation of MR job logs failing when Resourcemanager switches Sub-task Resolved Wangda Tan   Actions
        80.
        NM-Local dir cleanup failing when Resourcemanager switches Sub-task Open Unassigned   Actions
        81.
        Option "--forceactive" not works as described in usage of "yarn rmadmin -transitionToActive" Sub-task Closed Masatake Iwasaki   Actions
        82.
        [RM HA] Rest api endpoints doing redirect incorrectly Sub-task Closed Xuan Gong   Actions
        83.
        Improve the error message when attempting manual failover with auto-failover enabled Sub-task Closed Akira Ajisaka   Actions
        84.
        forcemanual transitionToStandby in RM-HA automatic-failover mode should change elector state Sub-task Open Masatake Iwasaki   Actions
        85.
        Documentation of ResourceManager HA should explain configurations about listen addresses Sub-task Closed Masatake Iwasaki   Actions
        86.
        Both RM in active state when Admin#transitionToActive failure from refeshAll() Sub-task Closed Bibin Chundatt   Actions
        87.
        RM HA UI redirection needs to be fixed when both RMs are in standby mode Sub-task Closed Xuan Gong   Actions
        88.
        RM should print alert messages if Zookeeper and Resourcemanager gets connection issue Sub-task Closed Xuan Gong   Actions
        89.
        Both RM becomes Active if all zookeepers can not connect to active RM Sub-task Resolved Xuan Gong   Actions
        90.
        Add retry on establishing Zookeeper conenction in EmbeddedElectorService#serviceInit Sub-task Resolved Xuan Gong   Actions

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              qwertymaniac Harsh J

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 51h
                51h
                Remaining:
                Remaining Estimate - 51h
                51h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Issue deployment