Details

    • Type: New Feature New Feature
    • Status: Reopened
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: resourcemanager
    • Labels:
      None

      Description

      This jira tracks work needed to be done to support one RM instance failing over to another RM instance so that we can have RM HA. Work includes leader election, transfer of control to leader and client re-direction to new leader.

      1. YARN ResourceManager Automatic Failover-rev-08-04-13.pdf
        207 kB
        Bikas Saha
      2. YARN ResourceManager Automatic Failover-rev-07-21-13.pdf
        207 kB
        Bikas Saha
      3. rm-ha-phase1-draft2.pdf
        170 kB
        Karthik Kambatla
      4. rm-ha-phase1-approach-draft1.pdf
        165 kB
        Karthik Kambatla

        Issue Links

        1.
        Separate out RM services into "Always On" and "Active" Sub-task Closed Karthik Kambatla  
         
        2.
        Implement RMHAProtocolService Sub-task Closed Karthik Kambatla  
         
        3.
        Test and verify ACL based ZKRMStateStore fencing for RM State Store Sub-task Resolved Karthik Kambatla  
         
        4.
        Add FailoverProxyProvider like capability to RMProxy Sub-task Closed Karthik Kambatla  
         
        5.
        Allow embedding leader election into the RM Sub-task Closed Karthik Kambatla  
         
        6.
        Expose RM active/standby state to Web UI and REST API Sub-task Closed Karthik Kambatla  
         
        7.
        Add admin support for HA operations Sub-task Closed Karthik Kambatla  
         
        8.
        Revisit exception handling in ZKRMStateStore post RM HA Sub-task Resolved Unassigned  
         
        9. Add shutdown support to non-service RM components Sub-task Open Xuan Gong  
         
        10. Support automatic failover using ZKFC Sub-task Open Karthik Kambatla  
         
        11. Add end-to-end tests for HA Sub-task Open Xuan Gong  
         
        12.
        Move init() of activeServices to ResourceManager#serviceStart() Sub-task Resolved Karthik Kambatla  
         
        13.
        Augment MiniYARNCluster to support HA mode Sub-task Closed Karthik Kambatla  
         
        14.
        Update HAServiceState to STOPPING on RM#stop() Sub-task Resolved Karthik Kambatla  
         
        15.
        ResourceManger.clusterTimeStamp should be reset when RM transitions to active Sub-task Resolved Unassigned  
         
        16. Verify RM HA works in secure clusters Sub-task Open Wing Yew Poon  
         
        17.
        Make improvements in ZKRMStateStore for fencing Sub-task Closed Karthik Kambatla  
         
        18.
        RM DT token service should have service addresses of both RMs Sub-task Closed Karthik Kambatla  
         
        19.
        Configuration to support multiple RMs Sub-task Closed Karthik Kambatla  
         
        20.
        RMHAProtocolService#serviceInit should handle HAUtil's IllegalArgumentException Sub-task Closed Tsuyoshi OZAWA  
         
        21.
        Promote AdminService to an Always-On service and merge in RMHAProtocolService Sub-task Closed Karthik Kambatla  
         
        22.
        Set HTTPS webapp address along with other RPC addresses in HAUtil Sub-task Closed Karthik Kambatla  
         
        23.
        Enabling HA should check Configuration contains multiple RMs Sub-task Closed Xuan Gong  
         
        24. RM should log using RMStore at startup time Sub-task Patch Available Tsuyoshi OZAWA

        0%

        Original Estimate - 3h
        Remaining Estimate - 3h
         
        25.
        Handle RM fails over after getApplicationID() and before submitApplication(). Sub-task Closed Xuan Gong

        0%

        Original Estimate - 48h
        Remaining Estimate - 48h
         
        26.
        HA config shouldn't affect NodeManager RPC addresses Sub-task Closed Karthik Kambatla  
         
        27.
        RM services should depend on ConfigurationProvider during startup too Sub-task Closed Xuan Gong  
         
        28.
        Move internal services logic from AdminService to ResourceManager Sub-task Closed Vinod Kumar Vavilapalli  
         
        29.
        WebApplicationProxy should be always-on w.r.t HA even if it is embedded in the RM Sub-task Closed Xuan Gong  
         
        30.
        Enabling HA should verify the RM service addresses configurations have been set for every RM Ids defined in RM_HA_IDs Sub-task Closed Xuan Gong  
         
        31. Utility to benchmark ZKRMStateStore#loadState for ResourceManager-HA Sub-task Open Tsuyoshi OZAWA  
         
        32.
        Mark appropriate protocol methods with the idempotent annotation or AtMostOnce annotation Sub-task Closed Xuan Gong  
         
        33.
        Use StandbyException instead of RMNotYetReadyException Sub-task Closed Karthik Kambatla  
         
        34.
        Web UI should redirect to active RM when HA is enabled. Sub-task Closed Cindy Li  
         
        35. Add an option to yarn rmadmin to clear the znode used by embedded elector Sub-task Open Karthik Kambatla  
         
        36. Add an easy way to turn on HA Sub-task Open Karthik Kambatla  
         
        37.
        Race between ServerRMProxy and ClientRMProxy setting RMProxy#INSTANCE Sub-task Closed Karthik Kambatla  
         
        38.
        ZK store should use a private password for root-node-acls Sub-task Closed Karthik Kambatla  
         
        39.
        RMDispatcher should be reset on transition to standby Sub-task Closed Xuan Gong  
         
        40.
        ActiveRMInfoProto fields should be optional Sub-task Closed Karthik Kambatla  
         
        41. Support explicit failover when automatic failover is enabled Sub-task Open Karthik Kambatla  
         
        42.
        HA-related rmadmin commands don't work on a secure cluster Sub-task Closed Karthik Kambatla  
         
        43.
        Make admin refresh of capacity scheduler configuration work across RM failover Sub-task Closed Xuan Gong  
         
        44.
        YARM RM HA requires different configs on different RM hosts Sub-task Closed Xuan Gong  
         
        45.
        Manual Failover does not work in secure clusters Sub-task Closed Xuan Gong  
         
        46.
        ZK store should attempt a write periodically to ensure it is still Active Sub-task Closed Karthik Kambatla  
         
        47.
        RMDTRenewer#getRMClient should use ClientRMProxy Sub-task Closed Karthik Kambatla  
         
        48.
        Webservice should redirect to active RM when HA is enabled. Sub-task Closed Cindy Li  
         
        49.
        add the ability to set yarn.resourcemanager.hostname.rm-id instead of setting all the various host:port properties for RM Sub-task Closed Xuan Gong  
         
        50.
        Set better defaults for HA configs for automatic failover Sub-task Closed Xuan Gong  
         
        51.
        Make admin refreshNodes work across RM failover Sub-task Closed Xuan Gong  
         
        52.
        Make admin refreshSuperUserGroupsConfiguration work across RM failover Sub-task Closed Xuan Gong  
         
        53.
        Make admin refreshAdminAcls work across RM failover Sub-task Closed Xuan Gong  
         
        54.
        Make admin refreshServiceAcls work across RM failover Sub-task Closed Xuan Gong  
         
        55.
        Make admin refreshUserToGroupsMappings of configuration work across RM failover Sub-task Closed Xuan Gong  
         
        56. Make admin refresh of Fair scheduler configuration work across RM failover Sub-task Open Xuan Gong  
         
        57. Cleanup YARN HAUtil class Sub-task Open Vinod Kumar Vavilapalli  
         
        58. Document RM HA Sub-task Open Karthik Kambatla  
         
        59.
        Reset cluster-metrics on transition to standby Sub-task Resolved Rohith  
         
        60.
        RM should get the updated Configurations when it transits from Standby to Active Sub-task Closed Xuan Gong  
         
        61.
        RMAdminCLI should check whether HA is enabled before executes transitionToActive/transitionToStandby Sub-task Closed Xuan Gong  
         
        62.
        Handle RM failovers during the submitApplication call. Sub-task Resolved Xuan Gong  
         
        63.
        Handle RM fail overs after the submitApplication call. Sub-task Closed Xuan Gong  
         
        64.
        Write test cases to verify that killApplication API works in RM HA Sub-task Closed Xuan Gong  
         
        65.
        When RM does the initiation, it should use loaded Configuration instead of bootstrap configuration. Sub-task Closed Xuan Gong  
         
        66.
        renewDelegationToken should survive RM failover Sub-task Closed Zhijie Shen  
         
        67. Handle AMRMTokens across RM failover Sub-task Open Unassigned  
         
        68.
        RM HA: AM link broken if the AM is on nodes other than RM Sub-task Closed Robert Kanter  
         
        69. Add retry cache support in ResourceManager Sub-task Open Tsuyoshi OZAWA  
         
        70. Persist ClusterMetrics across RM HA transitions Sub-task Open Unassigned  
         
        71. cancelDelegationToken should survive RM failover Sub-task Open Zhijie Shen  
         
        72. Both RM stuck in standby mode when automatic failover is enabled Sub-task Open Vinod Kumar Vavilapalli  
         
        73. ZK store: Add yarn.resourcemanager.zk-state-store.root-node.auth for root node auth Sub-task Open Karthik Kambatla  
         
        74. Implement and verify Scheduler#moveApplication() idempotent for CapacityScheduler/FairScheduler Sub-task Open Xuan Gong  
         
        75.
        Make ApplicationMasterProtocol#allocate AtMostOnce Sub-task Closed Xuan Gong  
         
        76.
        Add testcases to test AMRMToken on HA Sub-task Resolved Xuan Gong  
         
        77.
        Standby RM's conf, stacks, logLevel, metrics, jmx and logs links are redirecting to Active RM Sub-task Resolved Xuan Gong  
         
        78. Yarn standby RM taking long to transition to active Sub-task Patch Available Xuan Gong  
         

          Activity

          Hide
          Vinod Kumar Vavilapalli added a comment -

          Keeping it unassigned given multiple contributors.

          Show
          Vinod Kumar Vavilapalli added a comment - Keeping it unassigned given multiple contributors.
          Hide
          Bikas Saha added a comment -

          Please open a sub-task. A patch would be great or else someone else could pick it up too.

          Show
          Bikas Saha added a comment - Please open a sub-task. A patch would be great or else someone else could pick it up too.
          Hide
          Steve Loughran added a comment -

          I want to warn that HADOOP-9905 is going to drop the ZK dependency from the core hadoop-client POM. If YARN client is going to depend on ZK that's the client, not the server then it's going to have to explicitly add it.

          Show
          Steve Loughran added a comment - I want to warn that HADOOP-9905 is going to drop the ZK dependency from the core hadoop-client POM. If YARN client is going to depend on ZK that's the client, not the server then it's going to have to explicitly add it.
          Hide
          Karthik Kambatla added a comment -

          YARN-1281. Haven't had a chance to look into it yet.

          Show
          Karthik Kambatla added a comment - YARN-1281 . Haven't had a chance to look into it yet.
          Show
          Zhijie Shen added a comment - Is this a random test failure related to some ZKRMStateStore patch? https://builds.apache.org/job/PreCommit-YARN-Build/2291//testReport/org.apache.hadoop.yarn.server.resourcemanager.recovery/TestZKRMStateStoreZKClientConnections/testZKClientDisconnectAndReconnect/
          Hide
          Bikas Saha added a comment -

          Updating the document with minor updates based on comments.

          Show
          Bikas Saha added a comment - Updating the document with minor updates based on comments.
          Hide
          Karthik Kambatla added a comment -

          Bikas - I believe the following lines do mention it, but not as explicitly as your comment. Might help to be more explicit.

          External RPC services accept client requests for the active RM instance and reject client requests for the standby RM. It is expected that the standby RM will receive no external stimulus and hence there will be no internal activity (services etc.) on the standby RM.

          Show
          Karthik Kambatla added a comment - Bikas - I believe the following lines do mention it, but not as explicitly as your comment. Might help to be more explicit. External RPC services accept client requests for the active RM instance and reject client requests for the standby RM. It is expected that the standby RM will receive no external stimulus and hence there will be no internal activity (services etc.) on the standby RM.
          Hide
          Bikas Saha added a comment -

          The document describes the overall intent for an interested reader. It does not imply that the implementation steps will follow verbatim. We would like to make the changes incremental. E.g. just do manual fail-over at first to test the HAServiceProtocol impl. Similarly we can start with services being unaware of HA state and then add it later on. To be clear, the document says that only the external facing (RPC API layer) may be aware of HAState and this is not required at the beginning. HDFS does something similar with each client operation checked for being allowed in a given HAState. Internal services are not expected to know about HAState and it is expected that there is no activity in the internal state machines until the RM instance becomes active. Can you please take another look and let me know if this is not clear from the document. I can edit it to clarify.

          Show
          Bikas Saha added a comment - The document describes the overall intent for an interested reader. It does not imply that the implementation steps will follow verbatim. We would like to make the changes incremental. E.g. just do manual fail-over at first to test the HAServiceProtocol impl. Similarly we can start with services being unaware of HA state and then add it later on. To be clear, the document says that only the external facing (RPC API layer) may be aware of HAState and this is not required at the beginning. HDFS does something similar with each client operation checked for being allowed in a given HAState. Internal services are not expected to know about HAState and it is expected that there is no activity in the internal state machines until the RM instance becomes active. Can you please take another look and let me know if this is not clear from the document. I can edit it to clarify.
          Hide
          Karthik Kambatla added a comment -

          Thanks for posting this.

          Main comment: IIUC, the approach proposes that, right from the beginning (phase 1), each service in the RM should be aware of the Active/Standby HA states and behave accordingly (different state machines?). While starting all services immediately and waiting for transition to Active might be the correct approach eventually, for simplicity in the first implementation, should we start these services on transition to active?

          Show
          Karthik Kambatla added a comment - Thanks for posting this. Main comment: IIUC, the approach proposes that, right from the beginning (phase 1), each service in the RM should be aware of the Active/Standby HA states and behave accordingly (different state machines?). While starting all services immediately and waiting for transition to Active might be the correct approach eventually, for simplicity in the first implementation, should we start these services on transition to active?
          Hide
          Bikas Saha added a comment -

          Attaching first revision of overall approach. I am sure something is missing and something else can be improved. Will incorporate feedback as it comes. Will soon start creating work items that make sense in a chronological ordering of work. Making incremental progress while keeping the RM stable is the desired course of action (like YARN-128).

          Show
          Bikas Saha added a comment - Attaching first revision of overall approach. I am sure something is missing and something else can be improved. Will incorporate feedback as it comes. Will soon start creating work items that make sense in a chronological ordering of work. Making incremental progress while keeping the RM stable is the desired course of action (like YARN-128 ).
          Hide
          Karthik Kambatla added a comment -

          Thanks Bikas.

          1) extra daemon to manage because in fail-over scenarios each extra actor increases the combinatorics

          The wrapper is not an extra daemon. There will be a single daemon for the wrapper/RM. In the cold standby case, the wrapper starts the RM instance when it becomes active.

          2) the wrapper functionality seems to overlap the ZKFC and RM

          The wrapper interacts with the ZKFC and RM.

          3) RM will need to be changed to interact with the wrapper and the changes IMO should not be much different than those needed for direct ZKFC interaction

          Mostly agree with you here.

          I believe it boils down to the following: what state machine to incorporate the HA logic into. The wrapper approach essentially proposes two state machines - one for the core RM and one for the HA logic. Integrating the HA logic into the current RM will be adding more states to the current RM. There are (dis)advantages to both: the wrapper approach shouldn't affect non-HA instances, and might help with earlier adoption by major YARN users like Yahoo!

          In fact, what is being called as a wrapper is something that probably does wrap around core RM functionality but remains inside the RM. From what I see, it will be an impl of the HAProtocol interface around the core RM startup functionality.

          Looks like a promising approach. Let me take a closer look at the code and comment.

          Show
          Karthik Kambatla added a comment - Thanks Bikas. 1) extra daemon to manage because in fail-over scenarios each extra actor increases the combinatorics The wrapper is not an extra daemon. There will be a single daemon for the wrapper/RM. In the cold standby case, the wrapper starts the RM instance when it becomes active. 2) the wrapper functionality seems to overlap the ZKFC and RM The wrapper interacts with the ZKFC and RM. 3) RM will need to be changed to interact with the wrapper and the changes IMO should not be much different than those needed for direct ZKFC interaction Mostly agree with you here. I believe it boils down to the following: what state machine to incorporate the HA logic into. The wrapper approach essentially proposes two state machines - one for the core RM and one for the HA logic. Integrating the HA logic into the current RM will be adding more states to the current RM. There are (dis)advantages to both: the wrapper approach shouldn't affect non-HA instances, and might help with earlier adoption by major YARN users like Yahoo! In fact, what is being called as a wrapper is something that probably does wrap around core RM functionality but remains inside the RM. From what I see, it will be an impl of the HAProtocol interface around the core RM startup functionality. Looks like a promising approach. Let me take a closer look at the code and comment.
          Hide
          Bikas Saha added a comment -

          Thanks. I looked at the draft. Will incorporate stuff from it or use it as the base directly. In general, its slightly mixing fail-over with HA. The way RM restart has been envisioned, with a good implementation, downtime due to restart should not visible to users even with what is termed as a "cold" restart. Finally, I differ on the wrapper implementation because of 1) extra daemon to manage because in fail-over scenarios each extra actor increases the combinatorics 2) the wrapper functionality seems to overlap the ZKFC and RM 3) RM will need to be changed to interact with the wrapper and the changes IMO should not be much different than those needed for direct ZKFC interaction 4) we will not similar to HDFS patterns and that makes the system harder to maintain and manage. In fact, what is being called as a wrapper is something that probably does wrap around core RM functionality but remains inside the RM. From what I see, it will be an impl of the HAProtocol interface around the core RM startup functionality.

          Show
          Bikas Saha added a comment - Thanks. I looked at the draft. Will incorporate stuff from it or use it as the base directly. In general, its slightly mixing fail-over with HA. The way RM restart has been envisioned, with a good implementation, downtime due to restart should not visible to users even with what is termed as a "cold" restart. Finally, I differ on the wrapper implementation because of 1) extra daemon to manage because in fail-over scenarios each extra actor increases the combinatorics 2) the wrapper functionality seems to overlap the ZKFC and RM 3) RM will need to be changed to interact with the wrapper and the changes IMO should not be much different than those needed for direct ZKFC interaction 4) we will not similar to HDFS patterns and that makes the system harder to maintain and manage. In fact, what is being called as a wrapper is something that probably does wrap around core RM functionality but remains inside the RM. From what I see, it will be an impl of the HAProtocol interface around the core RM startup functionality.
          Hide
          Karthik Kambatla added a comment -

          Uploading draft-2 with more details on the wrapper approach.

          Show
          Karthik Kambatla added a comment - Uploading draft-2 with more details on the wrapper approach.
          Hide
          Bikas Saha added a comment -

          Thanks for the notes Karthik. I will go through it and incorporate stuff into the final document.

          Show
          Bikas Saha added a comment - Thanks for the notes Karthik. I will go through it and incorporate stuff into the final document.
          Hide
          Alejandro Abdelnur added a comment -

          Karthik, high level seems OK to me. One thing I would add, for the HTTP failover, if we have the RMHA wrapper approach the wrapper when in standby would redirect HTTP calls to the active RM. While this does not cover the case of rerouting if hitting an RM that crashed, it will cover the common case where somebody hits the running standby.

          Show
          Alejandro Abdelnur added a comment - Karthik, high level seems OK to me. One thing I would add, for the HTTP failover, if we have the RMHA wrapper approach the wrapper when in standby would redirect HTTP calls to the active RM. While this does not cover the case of rerouting if hitting an RM that crashed, it will cover the common case where somebody hits the running standby.
          Hide
          Karthik Kambatla added a comment -

          I have uploaded a basic design/approach document for phase 1 - rm-ha-phase1-approach-draft1.pdf. The doc basically proposes the use of cold standby and a RMHADaemon wrapper around RM for HA-related code.

          Please share your thoughts and comments to improve the design further.

          Show
          Karthik Kambatla added a comment - I have uploaded a basic design/approach document for phase 1 - rm-ha-phase1-approach-draft1.pdf. The doc basically proposes the use of cold standby and a RMHADaemon wrapper around RM for HA-related code. Please share your thoughts and comments to improve the design further.
          Hide
          Karthik Kambatla added a comment -

          Sounds good, thanks Bikas. I also have been thinking about this and working on a draft. Will get it to shape, and attach it here.

          Show
          Karthik Kambatla added a comment - Sounds good, thanks Bikas. I also have been thinking about this and working on a draft. Will get it to shape, and attach it here.
          Hide
          Bikas Saha added a comment -

          I will be posting a short design/road-map document shortly. If anyone has ideas, notes etc. then please start posting so that I can consolidate them. Overall, most of the tools and interfaces are already available in common via the HDFS HA project. The work will mainly be around integrating them with YARN/RM.

          Show
          Bikas Saha added a comment - I will be posting a short design/road-map document shortly. If anyone has ideas, notes etc. then please start posting so that I can consolidate them. Overall, most of the tools and interfaces are already available in common via the HDFS HA project. The work will mainly be around integrating them with YARN/RM.
          Hide
          Harsh J added a comment -

          Thanks Bikas! Agree it is related to resurrecting RM restart.

          Arun - It isn't a duplicate, at least the way I see it the MAPREDUCE-4326 targets a restart-recovery while this one I'd opened to target proper HA (multiple RMs, failing over automatically, with client code covered too). It is what may come after restart-ability is achieved. Thanks, I've reopened it

          Show
          Harsh J added a comment - Thanks Bikas! Agree it is related to resurrecting RM restart. Arun - It isn't a duplicate, at least the way I see it the MAPREDUCE-4326 targets a restart-recovery while this one I'd opened to target proper HA (multiple RMs, failing over automatically, with client code covered too). It is what may come after restart-ability is achieved. Thanks, I've reopened it
          Hide
          Arun C Murthy added a comment -

          Duplicate of MAPREDUCE-4326.

          Show
          Arun C Murthy added a comment - Duplicate of MAPREDUCE-4326 .
          Hide
          Bikas Saha added a comment -

          Assigning to myself since this looks like something that follows directly after MAPREDUCE-4326 and design/implementation would be closely related with it.

          Show
          Bikas Saha added a comment - Assigning to myself since this looks like something that follows directly after MAPREDUCE-4326 and design/implementation would be closely related with it.

            People

            • Assignee:
              Unassigned
              Reporter:
              Harsh J
            • Votes:
              2 Vote for this issue
              Watchers:
              71 Start watching this issue

              Dates

              • Created:
                Updated:

                Time Tracking

                Estimated:
                Original Estimate - 51h
                51h
                Remaining:
                Remaining Estimate - 51h
                51h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Development