Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Major
    • Resolution: Not A Problem
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: ha, namenode
    • Labels:
      None

      Description

      This JIRA intends to revisit the patches committed for HADOOP-7455 and HDFS-1974 & to provide more generic interfaces which allows alternative HA implementations to co-exist complying with HAServiceProtocol.

      Some of the considerations are
      1) Support life cycle methods (start*() and stop() APIs) in HAServiceProtocol
      2) Support custom states in HAServiceProtocol
      3) As per the patch submitted for HDFS-1974, Namenode implements HAService interface. This needs to be reconsidered.

      I will elaborate on these points, in the form of comments below.

      1. HAService_fw_Class_Diagram.JPG
        170 kB
        Justin Joseph
      2. HDFS-2354.1.patch
        43 kB
        Justin Joseph
      3. HDFS-2354.patch
        34 kB
        Justin Joseph
      4. Namenode HA using Backup Namenode as Hot Standby.pdf
        347 kB
        Justin Joseph

        Issue Links

          Activity

          Hide
          sureshms Suresh Srinivas added a comment -

          Support life cycle methods (start*() and stop() APIs) in HAServiceProtocol

          I already replied back to this comment in HADOOP-7455. One cannot call start method on a server that is not running! You also do not want to do stop as you may not be able to connect to the server and reliably stop it. The right thing to use is exiting start and stop scripts.

          I am not sure about introducing custom states. We can use this jira to discuss.

          Show
          sureshms Suresh Srinivas added a comment - Support life cycle methods (start*() and stop() APIs) in HAServiceProtocol I already replied back to this comment in HADOOP-7455 . One cannot call start method on a server that is not running! You also do not want to do stop as you may not be able to connect to the server and reliably stop it. The right thing to use is exiting start and stop scripts. I am not sure about introducing custom states. We can use this jira to discuss.
          Hide
          justin_joseph Justin Joseph added a comment -

          Suresh, whatever you told is true when you use Linux HA for determining Active / Standby roles. Linux HA can easily work with scripts. Going one step further, one can configure standByToActive script in HA framework & avoid the usage of HAServiceProtocol interface itself.

          I mean to say there are many ways of plumbing a Leader Election Service with Namenode. We need to consider other Leader Election mechanisms also. An alternative is Zookeeper based Leader Election Service where the LES and Namenode may run together in the same JVM.

          So the argument boils down to the question whether FailoverController daemon should be separate from the Namenode. If they are kept separately, having only the transition related APIs in HAServiceProtocol looks to be better decision. But when considering the case that both FailoverController daemon & Namenode runs in the same JVM, it is very much required that start*() & stop() APIs be supported in the HAServiceProtocol interface

          Show
          justin_joseph Justin Joseph added a comment - Suresh, whatever you told is true when you use Linux HA for determining Active / Standby roles. Linux HA can easily work with scripts. Going one step further, one can configure standByToActive script in HA framework & avoid the usage of HAServiceProtocol interface itself. I mean to say there are many ways of plumbing a Leader Election Service with Namenode. We need to consider other Leader Election mechanisms also. An alternative is Zookeeper based Leader Election Service where the LES and Namenode may run together in the same JVM. So the argument boils down to the question whether FailoverController daemon should be separate from the Namenode. If they are kept separately, having only the transition related APIs in HAServiceProtocol looks to be better decision. But when considering the case that both FailoverController daemon & Namenode runs in the same JVM, it is very much required that start*() & stop() APIs be supported in the HAServiceProtocol interface
          Hide
          justin_joseph Justin Joseph added a comment -

          Other considerations are

          1) There had been lot of debate in HADOOP-7455 on supporting Neutral State. Even now, there is lot of apprehension over how neutral state can be an alternative to various fencing techniques. I don’t want to use this JIRA for discussing neutral state again; we can do the discussion in a separate JIRA.

          In this JIRA, I just want to highlight "neutral state" as an example use case for supporting custom states in HAServiceProtocol interface.

          2) As per the patch submitted for HDFS-1974, Namenode implements HAService interface. For sure, there will be multiple implementations for HAService interface. So it is better to reconsider this relationship introduced in HDFS-1974.

          Show
          justin_joseph Justin Joseph added a comment - Other considerations are 1) There had been lot of debate in HADOOP-7455 on supporting Neutral State. Even now, there is lot of apprehension over how neutral state can be an alternative to various fencing techniques. I don’t want to use this JIRA for discussing neutral state again; we can do the discussion in a separate JIRA. In this JIRA, I just want to highlight "neutral state" as an example use case for supporting custom states in HAServiceProtocol interface. 2) As per the patch submitted for HDFS-1974 , Namenode implements HAService interface. For sure, there will be multiple implementations for HAService interface. So it is better to reconsider this relationship introduced in HDFS-1974 .
          Hide
          justin_joseph Justin Joseph added a comment -

          Attached the patch (just for initiating the discussions), along with a class diagram.

          The HAServiceProtocol is modeled to react to respond to various events / commands fired from a Leader Election Service. An abstract implementation of the protocol provided a framework to handle the events as state transitions. A concrete implementation can define the states & can fully control the lifecycle of the HA Service.

          The patch contains classes from both Hadoop Common & HDFS. I will separate them (& create different JIRAs) once some conclusion is made on the fate on this approach.

          Show
          justin_joseph Justin Joseph added a comment - Attached the patch (just for initiating the discussions), along with a class diagram. The HAServiceProtocol is modeled to react to respond to various events / commands fired from a Leader Election Service. An abstract implementation of the protocol provided a framework to handle the events as state transitions. A concrete implementation can define the states & can fully control the lifecycle of the HA Service. The patch contains classes from both Hadoop Common & HDFS. I will separate them (& create different JIRAs) once some conclusion is made on the fate on this approach.
          Hide
          justin_joseph Justin Joseph added a comment -

          Updated the patch to delete 3 existing classes.

          Show
          justin_joseph Justin Joseph added a comment - Updated the patch to delete 3 existing classes.
          Hide
          sureshms Suresh Srinivas added a comment -

          Suresh, whatever you told is true when you use Linux HA for determining Active / Standby roles. Linux HA can easily work with scripts. Going one step further, one can configure standByToActive script in HA framework & avoid the usage of HAServiceProtocol interface itself.

          And how does the HA framework talk to a server to affect standbyToActive? That is where HAServiceProtocol comes into play. In standbyToActive script, you run a command that talks to the namenode to make it active.

          I mean to say there are many ways of plumbing a Leader Election Service with Namenode. We need to consider other Leader Election mechanisms also. An alternative is Zookeeper based Leader Election Service where the LES and Namenode may run together in the same JVM.

          HDFS-1623 approach was to run FailoverController (or CRM/LRM in linux HA lingo) separate from the service for which HA is being enabled. The reasons for that is described in the document, hence HAServiceProtocol.

          So the argument boils down to the question whether FailoverController daemon should be separate from the Namenode. If they are kept separately, having only the transition related APIs in HAServiceProtocol looks to be better decision. But when considering the case that both FailoverController daemon & Namenode runs in the same JVM, it is very much required that start*() & stop() APIs be supported in the HAServiceProtocol interface.

          Even in this case, you could build a FailoverController that runs on the same JVM, but interacts with Namenode using HAServiceProtocol. The approach from HDFS-1974 does not preclude you from that.

          I have not looked at the patch in detail yet. At a high level, it is adding things/making changes that I am not in favor of. I will respond in detail about this.

          Show
          sureshms Suresh Srinivas added a comment - Suresh, whatever you told is true when you use Linux HA for determining Active / Standby roles. Linux HA can easily work with scripts. Going one step further, one can configure standByToActive script in HA framework & avoid the usage of HAServiceProtocol interface itself. And how does the HA framework talk to a server to affect standbyToActive? That is where HAServiceProtocol comes into play. In standbyToActive script, you run a command that talks to the namenode to make it active. I mean to say there are many ways of plumbing a Leader Election Service with Namenode. We need to consider other Leader Election mechanisms also. An alternative is Zookeeper based Leader Election Service where the LES and Namenode may run together in the same JVM. HDFS-1623 approach was to run FailoverController (or CRM/LRM in linux HA lingo) separate from the service for which HA is being enabled. The reasons for that is described in the document, hence HAServiceProtocol. So the argument boils down to the question whether FailoverController daemon should be separate from the Namenode. If they are kept separately, having only the transition related APIs in HAServiceProtocol looks to be better decision. But when considering the case that both FailoverController daemon & Namenode runs in the same JVM, it is very much required that start*() & stop() APIs be supported in the HAServiceProtocol interface. Even in this case, you could build a FailoverController that runs on the same JVM, but interacts with Namenode using HAServiceProtocol. The approach from HDFS-1974 does not preclude you from that. I have not looked at the patch in detail yet. At a high level, it is adding things/making changes that I am not in favor of. I will respond in detail about this.
          Hide
          justin_joseph Justin Joseph added a comment -

          Even in this case, you could build a FailoverController that runs on the same JVM, but interacts with Namenode using HAServiceProtocol. The approach from HDFS-1974 does not preclude you from that.

          Yes, in fact I am trying to get the right set of interfaces which will allow HA enabled Namenodes to work with any type of Cluster Resource Managers / HA Agents (Linux HA, Zk based Leader Election Service which may be either embedded or external etc..).

          I figured out the fundamental difference in our approaches. In your approach, HA Service Protocol is the set of commands for triggering state changes on Namenode, from Active to Standby or from Standby to Active. In my design, I have modeled it as a Local Resource Manager which sits between an HA Agent (or Cluster Resource Manager) and a Resource Agent (Namenode). I had to, since I considered the case where both the HA Agent and Namenode runs together in the same JVM. In this case, the Local Resource Manager will be responsible for starting Namenode as either Active or Standby and then at a later point of time, switching to a different role.

          Taking a consolidated view of the patches HADOOP-7455, HDFS-1974 and HDFS-2301 & comparing with the design I have attached in this JIRA, I feel our approaches are similar in many aspects. The implementation in HDFS-1974 has the drawback that Namenode / NamenodeRpcServer implements HAServiceProtocol, which is against the goal of HDFS-1623 to build a framework for HA.

          The various considerations for HA Service Protocol from my perspective are the following
          1) It should be easy to plug in the various approaches for building High Availability for Namenode (For example, one may chose from Backup Node based approach or Shared Storage approach or any other)
          2) It should be possible to work with any type of Cluster Resource Managers
          3) It should be possible to add different states, in addition to Active and Standby states.

          I had tried to build a comprehensive framework which takes the above points into consideration. I request to take a detailed look at the design and patch I have submitted & share your feedback with specific details.

          Show
          justin_joseph Justin Joseph added a comment - Even in this case, you could build a FailoverController that runs on the same JVM, but interacts with Namenode using HAServiceProtocol. The approach from HDFS-1974 does not preclude you from that. Yes, in fact I am trying to get the right set of interfaces which will allow HA enabled Namenodes to work with any type of Cluster Resource Managers / HA Agents (Linux HA, Zk based Leader Election Service which may be either embedded or external etc..). I figured out the fundamental difference in our approaches. In your approach, HA Service Protocol is the set of commands for triggering state changes on Namenode, from Active to Standby or from Standby to Active. In my design, I have modeled it as a Local Resource Manager which sits between an HA Agent (or Cluster Resource Manager) and a Resource Agent (Namenode). I had to, since I considered the case where both the HA Agent and Namenode runs together in the same JVM. In this case, the Local Resource Manager will be responsible for starting Namenode as either Active or Standby and then at a later point of time, switching to a different role. Taking a consolidated view of the patches HADOOP-7455 , HDFS-1974 and HDFS-2301 & comparing with the design I have attached in this JIRA, I feel our approaches are similar in many aspects. The implementation in HDFS-1974 has the drawback that Namenode / NamenodeRpcServer implements HAServiceProtocol, which is against the goal of HDFS-1623 to build a framework for HA. The various considerations for HA Service Protocol from my perspective are the following 1) It should be easy to plug in the various approaches for building High Availability for Namenode (For example, one may chose from Backup Node based approach or Shared Storage approach or any other) 2) It should be possible to work with any type of Cluster Resource Managers 3) It should be possible to add different states, in addition to Active and Standby states. I had tried to build a comprehensive framework which takes the above points into consideration. I request to take a detailed look at the design and patch I have submitted & share your feedback with specific details.
          Hide
          sureshms Suresh Srinivas added a comment -

          The implementation in HDFS-1974 has the drawback that Namenode / NamenodeRpcServer implements HAServiceProtocol, which is against the goal of HDFS-1623 to build a framework for HA.

          Can you explain hot it is against the goal of HDFS-1623? I think you may be misunderstanding HAServiceProtocol's intent. HAServiceProtocol allows Namenode to be treated as a resource. Whether it is CRM+LRM from Linux HA or FailoverController, both need this. This is irrespective of where FailoverController sits - as a separate daemon or a module inside the namenode process.

          I have been very confused by neutral state and how it avoids the need for fencing. If you create a detailed write up on this aspect, I am willing to spend some time on that.

          Show
          sureshms Suresh Srinivas added a comment - The implementation in HDFS-1974 has the drawback that Namenode / NamenodeRpcServer implements HAServiceProtocol, which is against the goal of HDFS-1623 to build a framework for HA. Can you explain hot it is against the goal of HDFS-1623 ? I think you may be misunderstanding HAServiceProtocol's intent. HAServiceProtocol allows Namenode to be treated as a resource. Whether it is CRM+LRM from Linux HA or FailoverController, both need this. This is irrespective of where FailoverController sits - as a separate daemon or a module inside the namenode process. I have been very confused by neutral state and how it avoids the need for fencing. If you create a detailed write up on this aspect, I am willing to spend some time on that.
          Hide
          justin_joseph Justin Joseph added a comment -

          The solution document which is already uploaded in the original contribution issue HDFS-2124

          Show
          justin_joseph Justin Joseph added a comment - The solution document which is already uploaded in the original contribution issue HDFS-2124
          Hide
          justin_joseph Justin Joseph added a comment -

          The neutral state is not totally a new way of handling network partitions. For example, consider the quorum concept supported by Pacemaker to handle split brain scenarios.

          http://www.woodwose.net/thatremindsme/2011/04/stonith-and-quorum-in-pacemaker/

          Whenever quorum is present, pacemaker will go with the majority vote on important decisions. When quorum is lost (i.e. the cluster is separated into groups where no group has a majority of votes) the behavior of the pool is determined by the no-quorum-policy property:

          • ignore - Do nothing when quorum is lost.
          • stop (default) - stop all resources in the affected cluster partition.
          • freeze - continue running existing resources, but don’t start any stopped ones.
          • suicide - fence all nodes in the affected partition.

          Neutral state is a modification of the stop option in the list above. When quorum is lost, instead of stopping the Namenode, it is turned into a state where it is neither Active nor Standby. When quorum is available again, it is turned to either Active or Standby state. The advantage of keeping the process in neutral state, compared to stopping the process, is about the time to start servicing the requests after network connection is restored. The down time will be very less in case of going to neutral state.

          Another approach is to turn the Namenode to Standby state when quorum is lost. In case of network partition, this may not be of any help since Standby Namenode won’t be able to ping the Active Namenode & it will just keep retrying. After network is restored, if this Namenode (which is turned to Standby when quorum is lost) is elected as active again, the time to turn the Namenode from Standby state to Active state will be higher depending on what is the timeout configured for RPC calls.

          When Zookeeper is used as the distributed coordinator, Neutral state is an effective way of handling the network partitions. In future, it may be supported by Pacemaker as an option for no-quorum-policy.

          I hope the approach is simple & clear. No one gets elected to Active state or continues to be in Active state unless a quorum decides so. If quorum is not available, it relinquishes the role & remains neutral. Being in neutral state helps to come back to Active state faster after quorum is available again.

          Show
          justin_joseph Justin Joseph added a comment - The neutral state is not totally a new way of handling network partitions. For example, consider the quorum concept supported by Pacemaker to handle split brain scenarios. http://www.woodwose.net/thatremindsme/2011/04/stonith-and-quorum-in-pacemaker/ Whenever quorum is present, pacemaker will go with the majority vote on important decisions. When quorum is lost (i.e. the cluster is separated into groups where no group has a majority of votes) the behavior of the pool is determined by the no-quorum-policy property: ignore - Do nothing when quorum is lost. stop (default) - stop all resources in the affected cluster partition. freeze - continue running existing resources, but don’t start any stopped ones. suicide - fence all nodes in the affected partition. Neutral state is a modification of the stop option in the list above. When quorum is lost, instead of stopping the Namenode, it is turned into a state where it is neither Active nor Standby. When quorum is available again, it is turned to either Active or Standby state. The advantage of keeping the process in neutral state, compared to stopping the process, is about the time to start servicing the requests after network connection is restored. The down time will be very less in case of going to neutral state. Another approach is to turn the Namenode to Standby state when quorum is lost. In case of network partition, this may not be of any help since Standby Namenode won’t be able to ping the Active Namenode & it will just keep retrying. After network is restored, if this Namenode (which is turned to Standby when quorum is lost) is elected as active again, the time to turn the Namenode from Standby state to Active state will be higher depending on what is the timeout configured for RPC calls. When Zookeeper is used as the distributed coordinator, Neutral state is an effective way of handling the network partitions. In future, it may be supported by Pacemaker as an option for no-quorum-policy . I hope the approach is simple & clear. No one gets elected to Active state or continues to be in Active state unless a quorum decides so. If quorum is not available, it relinquishes the role & remains neutral. Being in neutral state helps to come back to Active state faster after quorum is available again.
          Hide
          justin_joseph Justin Joseph added a comment -

          For better understanding of Backup Node based HA with Zookeeper as the distributed coordinator, please refer the Namenode HA using Backup Namenode as Hot Standby.pdf I have attached in this JIRA.

          Show
          justin_joseph Justin Joseph added a comment - For better understanding of Backup Node based HA with Zookeeper as the distributed coordinator, please refer the Namenode HA using Backup Namenode as Hot Standby.pdf I have attached in this JIRA.
          Hide
          justin_joseph Justin Joseph added a comment -

          Can you explain hot it is against the goal of HDFS-1623?

          HDFS-1623 aims to build a framework for HA, where there is room for multiple approaches for HA (Shared Storage / Backup Node Based, Zookeeper / Linux HA, IP failover / Intelligent Clients & so on). In the patch submitted for HDFS-1974, Namenode (NamenodeRpcServer) implements the HAServiceProtocol interface & provides the implementation based on Shared Storage approach. Now when I want to provide another implementation based on Backup Node, I am confused how fit it into the HA framework. This is the point I want to make.

          Show
          justin_joseph Justin Joseph added a comment - Can you explain hot it is against the goal of HDFS-1623 ? HDFS-1623 aims to build a framework for HA, where there is room for multiple approaches for HA (Shared Storage / Backup Node Based, Zookeeper / Linux HA, IP failover / Intelligent Clients & so on). In the patch submitted for HDFS-1974 , Namenode (NamenodeRpcServer) implements the HAServiceProtocol interface & provides the implementation based on Shared Storage approach. Now when I want to provide another implementation based on Backup Node, I am confused how fit it into the HA framework. This is the point I want to make.
          Hide
          sureshms Suresh Srinivas added a comment -

          HDFS-1623, since I am one of the authors of the original document, I do think the current approach cater to the issues you are talking about.
          I believe you can do every thing you want to do, keeping a simple interface to namenode and building other logic into the entity that interfaces with it, irrespective whether that entity is outside NN process or inside.

          Show
          sureshms Suresh Srinivas added a comment - HDFS-1623 , since I am one of the authors of the original document, I do think the current approach cater to the issues you are talking about. I believe you can do every thing you want to do, keeping a simple interface to namenode and building other logic into the entity that interfaces with it, irrespective whether that entity is outside NN process or inside.
          Hide
          atm Aaron T. Myers added a comment -

          Hey Justin, has Suresh addressed your concerns? If so, can we close out this JIRA?

          Show
          atm Aaron T. Myers added a comment - Hey Justin, has Suresh addressed your concerns? If so, can we close out this JIRA?
          Hide
          justin_joseph Justin Joseph added a comment -

          Aaron, I agree with Suresh that HAServiceProtocol can be kept as a simple interface to Namenode. To handle my use cases, I can build around this simple interface. I am planning to work on this this after the HA framework is ready for testing.

          We can close this JIRA for now. If needed, I will raise another issue later. Is that ok?

          Show
          justin_joseph Justin Joseph added a comment - Aaron, I agree with Suresh that HAServiceProtocol can be kept as a simple interface to Namenode. To handle my use cases, I can build around this simple interface. I am planning to work on this this after the HA framework is ready for testing. We can close this JIRA for now. If needed, I will raise another issue later. Is that ok?
          Hide
          atm Aaron T. Myers added a comment -

          Sounds good to me. Thanks a lot for your input, Justin.

          Show
          atm Aaron T. Myers added a comment - Sounds good to me. Thanks a lot for your input, Justin.

            People

            • Assignee:
              justin_joseph Justin Joseph
              Reporter:
              justin_joseph Justin Joseph
            • Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development