Hadoop YARN
  1. Hadoop YARN
  2. YARN-149

ResourceManager (RM) High-Availability (HA)

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 2.4.0
    • Fix Version/s: None
    • Component/s: resourcemanager
    • Labels:
    • Target Version/s:

      Description

      This jira tracks work needed to be done to support one RM instance failing over to another RM instance so that we can have RM HA. Work includes leader election, transfer of control to leader and client re-direction to new leader.

      1. rm-ha-phase1-approach-draft1.pdf
        165 kB
        Karthik Kambatla
      2. rm-ha-phase1-draft2.pdf
        170 kB
        Karthik Kambatla
      3. YARN ResourceManager Automatic Failover-rev-07-21-13.pdf
        207 kB
        Bikas Saha
      4. YARN ResourceManager Automatic Failover-rev-08-04-13.pdf
        207 kB
        Bikas Saha

        Issue Links

          Activity

          Hide
          Vinod Kumar Vavilapalli added a comment -

          Canceling the 'pdf' patch. The individual patches are getting committed separately..

          Show
          Vinod Kumar Vavilapalli added a comment - Canceling the 'pdf' patch. The individual patches are getting committed separately..
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12595828/YARN%20ResourceManager%20Automatic%20Failover-rev-08-04-13.pdf
          against trunk revision .

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3829//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12595828/YARN%20ResourceManager%20Automatic%20Failover-rev-08-04-13.pdf against trunk revision . -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3829//console This message is automatically generated.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Keeping it unassigned given multiple contributors.

          Show
          Vinod Kumar Vavilapalli added a comment - Keeping it unassigned given multiple contributors.
          Hide
          Bikas Saha added a comment -

          Please open a sub-task. A patch would be great or else someone else could pick it up too.

          Show
          Bikas Saha added a comment - Please open a sub-task. A patch would be great or else someone else could pick it up too.
          Hide
          Steve Loughran added a comment -

          I want to warn that HADOOP-9905 is going to drop the ZK dependency from the core hadoop-client POM. If YARN client is going to depend on ZK that's the client, not the server then it's going to have to explicitly add it.

          Show
          Steve Loughran added a comment - I want to warn that HADOOP-9905 is going to drop the ZK dependency from the core hadoop-client POM. If YARN client is going to depend on ZK that's the client, not the server then it's going to have to explicitly add it.
          Hide
          Karthik Kambatla added a comment -

          YARN-1281. Haven't had a chance to look into it yet.

          Show
          Karthik Kambatla added a comment - YARN-1281 . Haven't had a chance to look into it yet.
          Show
          Zhijie Shen added a comment - Is this a random test failure related to some ZKRMStateStore patch? https://builds.apache.org/job/PreCommit-YARN-Build/2291//testReport/org.apache.hadoop.yarn.server.resourcemanager.recovery/TestZKRMStateStoreZKClientConnections/testZKClientDisconnectAndReconnect/
          Hide
          Bikas Saha added a comment -

          Updating the document with minor updates based on comments.

          Show
          Bikas Saha added a comment - Updating the document with minor updates based on comments.
          Hide
          Karthik Kambatla added a comment -

          Bikas - I believe the following lines do mention it, but not as explicitly as your comment. Might help to be more explicit.

          External RPC services accept client requests for the active RM instance and reject client requests for the standby RM. It is expected that the standby RM will receive no external stimulus and hence there will be no internal activity (services etc.) on the standby RM.

          Show
          Karthik Kambatla added a comment - Bikas - I believe the following lines do mention it, but not as explicitly as your comment. Might help to be more explicit. External RPC services accept client requests for the active RM instance and reject client requests for the standby RM. It is expected that the standby RM will receive no external stimulus and hence there will be no internal activity (services etc.) on the standby RM.
          Hide
          Bikas Saha added a comment -

          The document describes the overall intent for an interested reader. It does not imply that the implementation steps will follow verbatim. We would like to make the changes incremental. E.g. just do manual fail-over at first to test the HAServiceProtocol impl. Similarly we can start with services being unaware of HA state and then add it later on. To be clear, the document says that only the external facing (RPC API layer) may be aware of HAState and this is not required at the beginning. HDFS does something similar with each client operation checked for being allowed in a given HAState. Internal services are not expected to know about HAState and it is expected that there is no activity in the internal state machines until the RM instance becomes active. Can you please take another look and let me know if this is not clear from the document. I can edit it to clarify.

          Show
          Bikas Saha added a comment - The document describes the overall intent for an interested reader. It does not imply that the implementation steps will follow verbatim. We would like to make the changes incremental. E.g. just do manual fail-over at first to test the HAServiceProtocol impl. Similarly we can start with services being unaware of HA state and then add it later on. To be clear, the document says that only the external facing (RPC API layer) may be aware of HAState and this is not required at the beginning. HDFS does something similar with each client operation checked for being allowed in a given HAState. Internal services are not expected to know about HAState and it is expected that there is no activity in the internal state machines until the RM instance becomes active. Can you please take another look and let me know if this is not clear from the document. I can edit it to clarify.
          Hide
          Karthik Kambatla added a comment -

          Thanks for posting this.

          Main comment: IIUC, the approach proposes that, right from the beginning (phase 1), each service in the RM should be aware of the Active/Standby HA states and behave accordingly (different state machines?). While starting all services immediately and waiting for transition to Active might be the correct approach eventually, for simplicity in the first implementation, should we start these services on transition to active?

          Show
          Karthik Kambatla added a comment - Thanks for posting this. Main comment: IIUC, the approach proposes that, right from the beginning (phase 1), each service in the RM should be aware of the Active/Standby HA states and behave accordingly (different state machines?). While starting all services immediately and waiting for transition to Active might be the correct approach eventually, for simplicity in the first implementation, should we start these services on transition to active?
          Hide
          Bikas Saha added a comment -

          Attaching first revision of overall approach. I am sure something is missing and something else can be improved. Will incorporate feedback as it comes. Will soon start creating work items that make sense in a chronological ordering of work. Making incremental progress while keeping the RM stable is the desired course of action (like YARN-128).

          Show
          Bikas Saha added a comment - Attaching first revision of overall approach. I am sure something is missing and something else can be improved. Will incorporate feedback as it comes. Will soon start creating work items that make sense in a chronological ordering of work. Making incremental progress while keeping the RM stable is the desired course of action (like YARN-128 ).
          Hide
          Karthik Kambatla added a comment -

          Thanks Bikas.

          1) extra daemon to manage because in fail-over scenarios each extra actor increases the combinatorics

          The wrapper is not an extra daemon. There will be a single daemon for the wrapper/RM. In the cold standby case, the wrapper starts the RM instance when it becomes active.

          2) the wrapper functionality seems to overlap the ZKFC and RM

          The wrapper interacts with the ZKFC and RM.

          3) RM will need to be changed to interact with the wrapper and the changes IMO should not be much different than those needed for direct ZKFC interaction

          Mostly agree with you here.

          I believe it boils down to the following: what state machine to incorporate the HA logic into. The wrapper approach essentially proposes two state machines - one for the core RM and one for the HA logic. Integrating the HA logic into the current RM will be adding more states to the current RM. There are (dis)advantages to both: the wrapper approach shouldn't affect non-HA instances, and might help with earlier adoption by major YARN users like Yahoo!

          In fact, what is being called as a wrapper is something that probably does wrap around core RM functionality but remains inside the RM. From what I see, it will be an impl of the HAProtocol interface around the core RM startup functionality.

          Looks like a promising approach. Let me take a closer look at the code and comment.

          Show
          Karthik Kambatla added a comment - Thanks Bikas. 1) extra daemon to manage because in fail-over scenarios each extra actor increases the combinatorics The wrapper is not an extra daemon. There will be a single daemon for the wrapper/RM. In the cold standby case, the wrapper starts the RM instance when it becomes active. 2) the wrapper functionality seems to overlap the ZKFC and RM The wrapper interacts with the ZKFC and RM. 3) RM will need to be changed to interact with the wrapper and the changes IMO should not be much different than those needed for direct ZKFC interaction Mostly agree with you here. I believe it boils down to the following: what state machine to incorporate the HA logic into. The wrapper approach essentially proposes two state machines - one for the core RM and one for the HA logic. Integrating the HA logic into the current RM will be adding more states to the current RM. There are (dis)advantages to both: the wrapper approach shouldn't affect non-HA instances, and might help with earlier adoption by major YARN users like Yahoo! In fact, what is being called as a wrapper is something that probably does wrap around core RM functionality but remains inside the RM. From what I see, it will be an impl of the HAProtocol interface around the core RM startup functionality. Looks like a promising approach. Let me take a closer look at the code and comment.
          Hide
          Bikas Saha added a comment -

          Thanks. I looked at the draft. Will incorporate stuff from it or use it as the base directly. In general, its slightly mixing fail-over with HA. The way RM restart has been envisioned, with a good implementation, downtime due to restart should not visible to users even with what is termed as a "cold" restart. Finally, I differ on the wrapper implementation because of 1) extra daemon to manage because in fail-over scenarios each extra actor increases the combinatorics 2) the wrapper functionality seems to overlap the ZKFC and RM 3) RM will need to be changed to interact with the wrapper and the changes IMO should not be much different than those needed for direct ZKFC interaction 4) we will not similar to HDFS patterns and that makes the system harder to maintain and manage. In fact, what is being called as a wrapper is something that probably does wrap around core RM functionality but remains inside the RM. From what I see, it will be an impl of the HAProtocol interface around the core RM startup functionality.

          Show
          Bikas Saha added a comment - Thanks. I looked at the draft. Will incorporate stuff from it or use it as the base directly. In general, its slightly mixing fail-over with HA. The way RM restart has been envisioned, with a good implementation, downtime due to restart should not visible to users even with what is termed as a "cold" restart. Finally, I differ on the wrapper implementation because of 1) extra daemon to manage because in fail-over scenarios each extra actor increases the combinatorics 2) the wrapper functionality seems to overlap the ZKFC and RM 3) RM will need to be changed to interact with the wrapper and the changes IMO should not be much different than those needed for direct ZKFC interaction 4) we will not similar to HDFS patterns and that makes the system harder to maintain and manage. In fact, what is being called as a wrapper is something that probably does wrap around core RM functionality but remains inside the RM. From what I see, it will be an impl of the HAProtocol interface around the core RM startup functionality.
          Hide
          Karthik Kambatla added a comment -

          Uploading draft-2 with more details on the wrapper approach.

          Show
          Karthik Kambatla added a comment - Uploading draft-2 with more details on the wrapper approach.
          Hide
          Bikas Saha added a comment -

          Thanks for the notes Karthik. I will go through it and incorporate stuff into the final document.

          Show
          Bikas Saha added a comment - Thanks for the notes Karthik. I will go through it and incorporate stuff into the final document.
          Hide
          Alejandro Abdelnur added a comment -

          Karthik, high level seems OK to me. One thing I would add, for the HTTP failover, if we have the RMHA wrapper approach the wrapper when in standby would redirect HTTP calls to the active RM. While this does not cover the case of rerouting if hitting an RM that crashed, it will cover the common case where somebody hits the running standby.

          Show
          Alejandro Abdelnur added a comment - Karthik, high level seems OK to me. One thing I would add, for the HTTP failover, if we have the RMHA wrapper approach the wrapper when in standby would redirect HTTP calls to the active RM. While this does not cover the case of rerouting if hitting an RM that crashed, it will cover the common case where somebody hits the running standby.
          Hide
          Karthik Kambatla added a comment -

          I have uploaded a basic design/approach document for phase 1 - rm-ha-phase1-approach-draft1.pdf. The doc basically proposes the use of cold standby and a RMHADaemon wrapper around RM for HA-related code.

          Please share your thoughts and comments to improve the design further.

          Show
          Karthik Kambatla added a comment - I have uploaded a basic design/approach document for phase 1 - rm-ha-phase1-approach-draft1.pdf. The doc basically proposes the use of cold standby and a RMHADaemon wrapper around RM for HA-related code. Please share your thoughts and comments to improve the design further.
          Hide
          Karthik Kambatla added a comment -

          Sounds good, thanks Bikas. I also have been thinking about this and working on a draft. Will get it to shape, and attach it here.

          Show
          Karthik Kambatla added a comment - Sounds good, thanks Bikas. I also have been thinking about this and working on a draft. Will get it to shape, and attach it here.
          Hide
          Bikas Saha added a comment -

          I will be posting a short design/road-map document shortly. If anyone has ideas, notes etc. then please start posting so that I can consolidate them. Overall, most of the tools and interfaces are already available in common via the HDFS HA project. The work will mainly be around integrating them with YARN/RM.

          Show
          Bikas Saha added a comment - I will be posting a short design/road-map document shortly. If anyone has ideas, notes etc. then please start posting so that I can consolidate them. Overall, most of the tools and interfaces are already available in common via the HDFS HA project. The work will mainly be around integrating them with YARN/RM.
          Hide
          Harsh J added a comment -

          Thanks Bikas! Agree it is related to resurrecting RM restart.

          Arun - It isn't a duplicate, at least the way I see it the MAPREDUCE-4326 targets a restart-recovery while this one I'd opened to target proper HA (multiple RMs, failing over automatically, with client code covered too). It is what may come after restart-ability is achieved. Thanks, I've reopened it

          Show
          Harsh J added a comment - Thanks Bikas! Agree it is related to resurrecting RM restart. Arun - It isn't a duplicate, at least the way I see it the MAPREDUCE-4326 targets a restart-recovery while this one I'd opened to target proper HA (multiple RMs, failing over automatically, with client code covered too). It is what may come after restart-ability is achieved. Thanks, I've reopened it
          Hide
          Arun C Murthy added a comment -

          Duplicate of MAPREDUCE-4326.

          Show
          Arun C Murthy added a comment - Duplicate of MAPREDUCE-4326 .
          Hide
          Bikas Saha added a comment -

          Assigning to myself since this looks like something that follows directly after MAPREDUCE-4326 and design/implementation would be closely related with it.

          Show
          Bikas Saha added a comment - Assigning to myself since this looks like something that follows directly after MAPREDUCE-4326 and design/implementation would be closely related with it.

            People

            • Assignee:
              Unassigned
              Reporter:
              Harsh J
            • Votes:
              3 Vote for this issue
              Watchers:
              77 Start watching this issue

              Dates

              • Created:
                Updated:

                Time Tracking

                Estimated:
                Original Estimate - 51h
                51h
                Remaining:
                Remaining Estimate - 51h
                51h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Development