HBase
  1. HBase
  2. HBASE-10296

Replace ZK with a consensus lib(paxos,zab or raft) running within master processes to provide better master failover performance and state consistency

    Details

    • Type: Brainstorming Brainstorming
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
      None

      Description

      Currently master relies on ZK to elect active master, monitor liveness and store almost all of its states, such as region states, table info, replication info and so on. And zk also plays as a channel for master-regionserver communication(such as in region assigning) and client-regionserver communication(such as replication state/behavior change).
      But zk as a communication channel is fragile due to its one-time watch and asynchronous notification mechanism which together can leads to missed events(hence missed messages), for example the master must rely on the state transition logic's idempotence to maintain the region assigning state machine's correctness, actually almost all of the most tricky inconsistency issues can trace back their root cause to the fragility of zk as a communication channel.
      Replace zk with paxos running within master processes have following benefits:
      1. better master failover performance: all master, either the active or the standby ones, have the same latest states in memory(except lag ones but which can eventually catch up later on). whenever the active master dies, the newly elected active master can immediately play its role without such failover work as building its in-memory states by consulting meta-table and zk.
      2. better state consistency: master's in-memory states are the only truth about the system,which can eliminate inconsistency from the very beginning. and though the states are contained by all masters, paxos guarantees they are identical at any time.
      3. more direct and simple communication pattern: client changes state by sending requests to master, master and regionserver talk directly to each other by sending request and response...all don't bother to using a third-party storage like zk which can introduce more uncertainty, worse latency and more complexity.
      4. zk can only be used as liveness monitoring for determining if a regionserver is dead, and later on we can eliminate zk totally when we build heartbeat between master and regionserver.

      I know this might looks like a very crazy re-architect, but it deserves deep thinking and serious discussion for it, right?

        Issue Links

          Activity

          Honghua Feng created issue -
          Hide
          Lars Hofhansl added a comment -

          If we do this, let's use RAFT: https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf

          Note that we used to have heartbeat between the various services (before my time, though), which was all removed in favor of ZK.

          Show
          Lars Hofhansl added a comment - If we do this, let's use RAFT: https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf Note that we used to have heartbeat between the various services (before my time, though), which was all removed in favor of ZK.
          Hide
          Steve Loughran added a comment -

          One aspect of ZK that is worth remembering is that it lets other apps keep an eye on what is going on

          Show
          Steve Loughran added a comment - One aspect of ZK that is worth remembering is that it lets other apps keep an eye on what is going on
          Hide
          Andrew Purtell added a comment -

          Use of ZK has issues but what we had before was much worse. We had heartbeating and partially desynchronized state in a bunch of places. Rather than implement our own consensus protocol we used the specialist component ZK. Engineering distributed consensus protocols is a long term endeavor full of corner cases and hard to debug problems. It is worth consideration, but maybe only as a last resort. Does something about our use of ZK or ZK itself have fatal issues?

          Show
          Andrew Purtell added a comment - Use of ZK has issues but what we had before was much worse. We had heartbeating and partially desynchronized state in a bunch of places. Rather than implement our own consensus protocol we used the specialist component ZK. Engineering distributed consensus protocols is a long term endeavor full of corner cases and hard to debug problems. It is worth consideration, but maybe only as a last resort. Does something about our use of ZK or ZK itself have fatal issues?
          Hide
          Steve Loughran added a comment -

          The google chubby paper goes into some detail about why they implemented a Paxos Service and not a paxos library.

          yet perhaps you could persuade the ZK team to rework the code enough that you could reuse it independently of ZK.

          Implementing a consensus protocol is surprisingly hard as you have to

          1. understand Paxos
          2. implement it
          3. prove that your implementation is correct

          Unit tests are not enough -talk to the ZK team about what they had to do to show that it works

          Show
          Steve Loughran added a comment - The google chubby paper goes into some detail about why they implemented a Paxos Service and not a paxos library. yet perhaps you could persuade the ZK team to rework the code enough that you could reuse it independently of ZK. Implementing a consensus protocol is surprisingly hard as you have to understand Paxos implement it prove that your implementation is correct Unit tests are not enough -talk to the ZK team about what they had to do to show that it works
          Hide
          Eric Charles added a comment -

          Steve Loughran zk will only tell you a part of the hbase status, so it the usage of zk by a third-app should be seen as a temp workaround (or a nice side-effect). I would be more tempted to prohibit access from third app to hbase-zk (even in read) and enhance hbase to provide monitoring/status api. Just want to point this so the 'status' api of zk should not come as an argument in the choice of the consensus mean.

          About the choice, removing a component to maintain (zk) is a good step forward, but there are certainly arguments to keep it.

          Show
          Eric Charles added a comment - Steve Loughran zk will only tell you a part of the hbase status, so it the usage of zk by a third-app should be seen as a temp workaround (or a nice side-effect). I would be more tempted to prohibit access from third app to hbase-zk (even in read) and enhance hbase to provide monitoring/status api. Just want to point this so the 'status' api of zk should not come as an argument in the choice of the consensus mean. About the choice, removing a component to maintain (zk) is a good step forward, but there are certainly arguments to keep it.
          Hide
          Steve Loughran added a comment -

          ..but that ZK path is used to find the hbase master even if it moves round a cluster -what would happen there?

          Show
          Steve Loughran added a comment - ..but that ZK path is used to find the hbase master even if it moves round a cluster -what would happen there?
          Hide
          Andrew Purtell added a comment -

          yet perhaps you could persuade the ZK team to rework the code enough that you could reuse it independently of ZK.

          AFAIK, the topic comes up once a year or so and there isn't sufficient interest.

          The Harvey Mudd Clinic tried extracting ZAB once, circa 2008. See ZOOKEEPER-30 and http://wiki.apache.org/hadoop/ZooKeeper/ZabProtocol . The contribution failed for lack of a code grant required by the Hadoop PMC chair at the time. ZooKeeper was still a contrib project of Hadoop then. There is different ZooKeeper project leadership now but the code is way out of date. Perhaps it could serve as a template for (re)extraction.

          Show
          Andrew Purtell added a comment - yet perhaps you could persuade the ZK team to rework the code enough that you could reuse it independently of ZK. AFAIK, the topic comes up once a year or so and there isn't sufficient interest. The Harvey Mudd Clinic tried extracting ZAB once, circa 2008. See ZOOKEEPER-30 and http://wiki.apache.org/hadoop/ZooKeeper/ZabProtocol . The contribution failed for lack of a code grant required by the Hadoop PMC chair at the time. ZooKeeper was still a contrib project of Hadoop then. There is different ZooKeeper project leadership now but the code is way out of date. Perhaps it could serve as a template for (re)extraction.
          Hide
          Eric Charles added a comment -

          mmh, no answer here... I have read the description of this jira and
          didn't find any clue either. The best would be to have a draft design to
          see if it is really doable.

          Show
          Eric Charles added a comment - mmh, no answer here... I have read the description of this jira and didn't find any clue either. The best would be to have a draft design to see if it is really doable.
          Hide
          Eric Charles added a comment -

          (commented via mail, but context has been removed, so replaying my comment via web ui)

          > ..but that ZK path is used to find the hbase master even if it moves round a cluster -what would happen there?

          mmh, no answer here... I have read the description of this jira and didn't find any clue either. The best would be to have a draft design to see if it is really doable.

          Show
          Eric Charles added a comment - (commented via mail, but context has been removed, so replaying my comment via web ui) > ..but that ZK path is used to find the hbase master even if it moves round a cluster -what would happen there? mmh, no answer here... I have read the description of this jira and didn't find any clue either. The best would be to have a draft design to see if it is really doable.
          Hide
          Honghua Feng added a comment -

          but that ZK path is used to find the hbase master even if it moves round a cluster -what would happen there?

          Typically we adopt master-based paxos in practice, so naturally the master process hosting the master paxos replica is the active master. the active master is elected by paxos protocal, not by zk. and each standby master knows who is the current active master. when the active master moves around(for instance when active master dies or its lease timeout), the client or app who attempts to talk with the old active master will fail in two ways: fail to connect if active master dies, or fail by knowing it's now not the active master and the current new active master info. for the former the client/app will try randomly other alive master instance and that master will accept its request if it's the new active master, or tell it the current active master info if it's not the current active master. for the latter it can now talk to the active master...and like how to access a zk, client/app should know the master assemble addresses to access a HBase cluster. (assuming you're saying finding the active master, correct me if I'm wrong)

          Show
          Honghua Feng added a comment - but that ZK path is used to find the hbase master even if it moves round a cluster -what would happen there? Typically we adopt master-based paxos in practice, so naturally the master process hosting the master paxos replica is the active master. the active master is elected by paxos protocal, not by zk. and each standby master knows who is the current active master. when the active master moves around(for instance when active master dies or its lease timeout), the client or app who attempts to talk with the old active master will fail in two ways: fail to connect if active master dies, or fail by knowing it's now not the active master and the current new active master info. for the former the client/app will try randomly other alive master instance and that master will accept its request if it's the new active master, or tell it the current active master info if it's not the current active master. for the latter it can now talk to the active master...and like how to access a zk, client/app should know the master assemble addresses to access a HBase cluster. (assuming you're saying finding the active master, correct me if I'm wrong)
          Hide
          Honghua Feng added a comment -

          One aspect of ZK that is worth remembering is that it lets other apps keep an eye on what is going on

          Yes, this is a good question. ZK's watch/notification pattern can be viewed as a communication mechanism: each ZK node represents a piece of data, app A updates this ZK node when it updates the data, then app B which has a watch on it will receives a notification when the data is updated.
          If we use paxos to replace ZK, the data represented by each ZK node now is hosted within each master process' memory as an data structure, updated via the paxos replicated state machine triggered by client/regionserver requests. Now the watch/notification center is moved from ZK to master, and we can still use the node->watch-list mechanism for implementation which's used by ZK.
          The above 'keep an eye on what is going on'(or watch/notify) now is changed in two ways:
          1. master <> zk <> regionserver communication now is replaced by master<->regionserver direct communication
          2. client<>zk<>regionserver communication now is replaced by client<>master<>regionserver communication (master plays the role of original ZK)
          a note: we can now provide more flexible options by exposing sync/async notification and one-time/permanent watch. by ZK only one-time async watch is provided.

          Show
          Honghua Feng added a comment - One aspect of ZK that is worth remembering is that it lets other apps keep an eye on what is going on Yes, this is a good question. ZK's watch/notification pattern can be viewed as a communication mechanism: each ZK node represents a piece of data, app A updates this ZK node when it updates the data, then app B which has a watch on it will receives a notification when the data is updated. If we use paxos to replace ZK, the data represented by each ZK node now is hosted within each master process' memory as an data structure, updated via the paxos replicated state machine triggered by client/regionserver requests. Now the watch/notification center is moved from ZK to master, and we can still use the node->watch-list mechanism for implementation which's used by ZK. The above 'keep an eye on what is going on'(or watch/notify) now is changed in two ways: 1. master < > zk < > regionserver communication now is replaced by master<->regionserver direct communication 2. client< >zk< >regionserver communication now is replaced by client< >master< >regionserver communication (master plays the role of original ZK) a note: we can now provide more flexible options by exposing sync/async notification and one-time/permanent watch. by ZK only one-time async watch is provided.
          Hide
          Honghua Feng added a comment -

          The google chubby paper goes into some detail about why they implemented a Paxos Service and not a paxos library.

          I believe google should have a paxos library, which is used in megastore and spanner, right? And this fact is mentioned in a google paper paxos made live
          Implementing paxos as a standalone/shared service or a library has their own benefits and drawbacks.
          A service: simple API and simple for app to use, can be shared by multiple apps; but abuse by one app can negatively influence other apps using the same paxos service (we ever encountered several times such cases before )
          A library: more difficult for app to use, but have better isolation level(won't be affected by possible abuse from other app), and have more primitives and more flexibility.

          Show
          Honghua Feng added a comment - The google chubby paper goes into some detail about why they implemented a Paxos Service and not a paxos library. I believe google should have a paxos library, which is used in megastore and spanner, right? And this fact is mentioned in a google paper paxos made live Implementing paxos as a standalone/shared service or a library has their own benefits and drawbacks. A service: simple API and simple for app to use, can be shared by multiple apps; but abuse by one app can negatively influence other apps using the same paxos service (we ever encountered several times such cases before ) A library: more difficult for app to use, but have better isolation level(won't be affected by possible abuse from other app), and have more primitives and more flexibility.
          Hide
          Honghua Feng added a comment -

          Lars Hofhansl / Andrew Purtell / Steve Loughran / Eric Charles :
          1. paxos / raft / zab library extracted from ZK are all good candidates
          2. I agree that implementing a correct consensus protocal for production usage is extremely hard, that's why I tagged this jira's type as brainstorming. my intention is to raise it to discuss what's a better / more reasonable architect would look like.
          3. If we finally all agree on a better architect/design after analysis/discussion/proof, we can approach it in an conservative and incremental way, maybe eventually someday we make it

          Show
          Honghua Feng added a comment - Lars Hofhansl / Andrew Purtell / Steve Loughran / Eric Charles : 1. paxos / raft / zab library extracted from ZK are all good candidates 2. I agree that implementing a correct consensus protocal for production usage is extremely hard, that's why I tagged this jira's type as brainstorming. my intention is to raise it to discuss what's a better / more reasonable architect would look like. 3. If we finally all agree on a better architect/design after analysis/discussion/proof, we can approach it in an conservative and incremental way, maybe eventually someday we make it
          Hide
          Honghua Feng added a comment -

          thanks all guys for the questions/directing/material/history-notes, really appreciated

          Show
          Honghua Feng added a comment - thanks all guys for the questions/directing/material/history-notes, really appreciated
          Hide
          Lars Hofhansl added a comment - - edited

          I always thought that having processes participate in the coordination process directly (as group members) rather than using an external group membership would be better, which I was very disappointed when I first looked at ZK that ZAB was buried to deeply in with the rest of ZK.

          ZK on the other hand is simple (because somebody else solved the hard problems for us). So I can see this go both ways.

          On some level that ties into the discussion as to why we have master and regionserver roles. Cannot all servers serve both roles as needed?

          Show
          Lars Hofhansl added a comment - - edited I always thought that having processes participate in the coordination process directly (as group members) rather than using an external group membership would be better, which I was very disappointed when I first looked at ZK that ZAB was buried to deeply in with the rest of ZK. ZK on the other hand is simple (because somebody else solved the hard problems for us). So I can see this go both ways. On some level that ties into the discussion as to why we have master and regionserver roles. Cannot all servers serve both roles as needed?
          Hide
          Eric Charles added a comment -

          I think the issue is that zk is just a half solution. It is a coordination util, but the job is still to be done. For now, it the coordination logic mainly done in the hbase code (a bit everywhere I think, there is no 'coordination' package, no separation of concern)
          To evolve, there are 2 directions:
          1. Embed the coordination with a protocol where the coordination is built-in (zab, p axos or whatever).
          2. Move the coordination out of the hbase code to an external layer. Zk is not enough, would Helix (which relies on Zk) be a good fit ?

          Show
          Eric Charles added a comment - I think the issue is that zk is just a half solution. It is a coordination util, but the job is still to be done. For now, it the coordination logic mainly done in the hbase code (a bit everywhere I think, there is no 'coordination' package, no separation of concern) To evolve, there are 2 directions: 1. Embed the coordination with a protocol where the coordination is built-in (zab, p axos or whatever). 2. Move the coordination out of the hbase code to an external layer. Zk is not enough, would Helix (which relies on Zk) be a good fit ?
          Hide
          Honghua Feng added a comment -

          I always thought that having processes participate in the coordination process directly (as group members) rather than using an external group membership would be better

          Yes, me too, I hold the same thought all the time

          On some level that ties into the discussion as to why we have master and regionserver roles. Cannot all servers serve both roles as needed?

          This will leads to a totally decentralized architect, such as Dynamo?...it is a much more aggressive re-architect and almost a complete overhaul. Most current code can remain untouched if we only incorporate the zk functionality into master process, right?

          Show
          Honghua Feng added a comment - I always thought that having processes participate in the coordination process directly (as group members) rather than using an external group membership would be better Yes, me too, I hold the same thought all the time On some level that ties into the discussion as to why we have master and regionserver roles. Cannot all servers serve both roles as needed? This will leads to a totally decentralized architect, such as Dynamo?...it is a much more aggressive re-architect and almost a complete overhaul. Most current code can remain untouched if we only incorporate the zk functionality into master process, right?
          Hide
          Andrew Purtell added a comment -

          This will leads to a totally decentralized architect, such as Dynamo?.

          If I understand Lars correctly, not quite. There would still be master and regionserver roles, just that any HBase process could perform those roles, presumably determined by running elections. With Dynamo every process has a homogeneous set of behaviors IIRC. The refactoring wouldn't be total, but instead of HRegionServer and HMaster as "main" classes, there would be a new main class participating in role elections spawning threads/instances to perform those roles according to outcome. Or something like that. That what you are thinking Lars Hofhansl?

          Show
          Andrew Purtell added a comment - This will leads to a totally decentralized architect, such as Dynamo?. If I understand Lars correctly, not quite. There would still be master and regionserver roles, just that any HBase process could perform those roles, presumably determined by running elections. With Dynamo every process has a homogeneous set of behaviors IIRC. The refactoring wouldn't be total, but instead of HRegionServer and HMaster as "main" classes, there would be a new main class participating in role elections spawning threads/instances to perform those roles according to outcome. Or something like that. That what you are thinking Lars Hofhansl ?
          Hide
          Andrew Purtell added a comment -

          Also if this conversation morphs to a "let's P2P HBase", that would be out of scope for this JIRA which is "Replace ZK with a paxos running within master processes to provide better master failover performance and state consistency". Perhaps create or use another JIRA for that discussion?

          I will say this though: One crucial difference between a Dynamo or Cassandra install versus HBase is they do their own persistence. Maybe a HBase with a fully homogeneous set of behaviors would be useful, but unless running on an equivalently distributed persistence layer, we would still operationally be subject to HDFS's master-slave architecture. And a master-slave architecture has some distinct advantages over a P2P one without central control: Under failure conditions, it is easier to get things under control with one commander giving orders rather than relying on the convergence of emergent behaviors.

          Show
          Andrew Purtell added a comment - Also if this conversation morphs to a "let's P2P HBase", that would be out of scope for this JIRA which is "Replace ZK with a paxos running within master processes to provide better master failover performance and state consistency". Perhaps create or use another JIRA for that discussion? I will say this though: One crucial difference between a Dynamo or Cassandra install versus HBase is they do their own persistence. Maybe a HBase with a fully homogeneous set of behaviors would be useful, but unless running on an equivalently distributed persistence layer, we would still operationally be subject to HDFS's master-slave architecture. And a master-slave architecture has some distinct advantages over a P2P one without central control: Under failure conditions, it is easier to get things under control with one commander giving orders rather than relying on the convergence of emergent behaviors.
          Hide
          Lars Hofhansl added a comment -

          There would still be master and regionserver roles, just that any HBase process could perform those roles, presumably determined by running elections.

          Exactly. Now that we have the logic that all HMasters know who is the currently active master (HBASE-5083) we could just run the HMaster tasks as part of every server and get rid of the distinct HMaster role (something that Jesse Yates had wanted to do a while ago). But I agree with Andy, we are digressing.

          Under failure conditions, it is easier to get things under control with one commander

          Yep, I cannot stress this enough. This discussion comes up at work all the time and I keep making exactly this point, but somehow it is always lost
          All larger organizations (that I know) favor (multi) master designs over a decentralized approach.

          Show
          Lars Hofhansl added a comment - There would still be master and regionserver roles, just that any HBase process could perform those roles, presumably determined by running elections. Exactly. Now that we have the logic that all HMasters know who is the currently active master ( HBASE-5083 ) we could just run the HMaster tasks as part of every server and get rid of the distinct HMaster role (something that Jesse Yates had wanted to do a while ago). But I agree with Andy, we are digressing. Under failure conditions, it is easier to get things under control with one commander Yep, I cannot stress this enough. This discussion comes up at work all the time and I keep making exactly this point, but somehow it is always lost All larger organizations (that I know) favor (multi) master designs over a decentralized approach.
          Hide
          stack added a comment -

          What would the steps involved moving off zk to a group of masters keeping consensus look like?

          Show
          stack added a comment - What would the steps involved moving off zk to a group of masters keeping consensus look like?
          Hide
          Honghua Feng added a comment -

          sorry for the late reply

          Lars Hofhansl, Andrew Purtell : thanks for the clarifying to Lars Hofhansl's proposal that 'all servers serve both master and regionserver roles as needed', I can now understand what he really meant, really interesting. But if we eventually decide to replace ZK with a consensus algorithm(paxos, zab or raft) running within masters, it is less appealing to make all servers run both master and regionserver roles as needed:

          • by this replacing, we reduced the total machines from X masters(typically 3?) plus Y zookeepers(typically 3 or 5) to only Y masters (as zookeeper, typically 3 or 5) by incorporating the master logic (create/remove tables, failover, balance...) and the zookeeper logic (replicating states) to the same set of processes/servers. after this replacing and incorporating, the standby masters are not that 'standby' as previous master which just stay idle to wait for the active master to die and then compete to take it over, wasting the standby masters' machine resource most of the time, these standby masters now participate in consensus making, state replicating/persisting, making snapshot and etc. we don't mind schedule some separate machines for these tasks, just as we don't mind schedule separate machines for followers within a zookeeper assemble and don't ever think about to reuse the follower/standby zookeeper servers to serve as regionserver, right?
          • it's preferred to keep the membership of the consensus algorithm to be fixed, within a fixed small set of predesignated machines. it can noticeably complicate the total design/implementation to support dynamic the consensus algorithm's membership, if we permit all servers of a HBase cluster to be able to play the role of master

          I also agree with you two on that a total P2P architect like Dynamo/Cassandra will have a much more difficult time when handling failover.

          What would the steps involved moving off zk to a group of masters keeping consensus look like?

          stack : the rough steps I can think about for now is as below:

          1. implement a robust/reliable consensus lib(paxos/zab/raft)
          2. redesign the master based on this consensus lib. now we don't need to write out the HBase/master states such as region-assign-status, replication info, table info to an far-away/outside persistent/reliable storage such as zookeeper or another HBase system table, we just replicate them among the masters, master itself is the only truth about these states.
          3. HBASE-5487 aims for a master redesign to store the states to a system table, though can avoid the state maintenance problem derived from missed event by zookeeper's watch/notify mechanism, it still will have the problem of keeping/maintaining truth in two different sites(master memory and the system table) and we still need to be very careful at implementing. it's always a headache when the state reader/maintainer(master) and state persistence layer are not in the same server;
          4. HBASE-10295 aims for moving replication info from zookeeper to another system table. if we achieve using consensus lib within master, we can represent replication info as just another in-memory data structure, but not a different system table. (personally I don't think using zookeeper node to store replication info is an as severe problem as region-assign-status, since replication-aware logic has the inherent idempotence: it cares about only what's the final state of some replication info when it's changed, but not how it changes to the final state(the state transition process). on the contrary region assignment logic is more like a state machine, it does care about a state is transitioned from a valid previous state(it looks like transition from an invalid state when some event/state missed) otherwise the code can be pretty tricky and hard to understand/maintain. another concern for moving replication info from zookeeper to another system table is that it's hard for HBase table to represent a deep tree-like hierarchical structure (HBase table can only naturally represent not more than 3-layer structure (via row + cf + qualifier), zookeeper and in-memory data structures don't have such limited hierarchical layer problem)
          Show
          Honghua Feng added a comment - sorry for the late reply Lars Hofhansl , Andrew Purtell : thanks for the clarifying to Lars Hofhansl 's proposal that 'all servers serve both master and regionserver roles as needed', I can now understand what he really meant, really interesting . But if we eventually decide to replace ZK with a consensus algorithm(paxos, zab or raft) running within masters, it is less appealing to make all servers run both master and regionserver roles as needed: by this replacing, we reduced the total machines from X masters(typically 3?) plus Y zookeepers(typically 3 or 5) to only Y masters (as zookeeper, typically 3 or 5) by incorporating the master logic (create/remove tables, failover, balance...) and the zookeeper logic (replicating states) to the same set of processes/servers. after this replacing and incorporating, the standby masters are not that 'standby' as previous master which just stay idle to wait for the active master to die and then compete to take it over, wasting the standby masters' machine resource most of the time, these standby masters now participate in consensus making, state replicating/persisting, making snapshot and etc. we don't mind schedule some separate machines for these tasks, just as we don't mind schedule separate machines for followers within a zookeeper assemble and don't ever think about to reuse the follower/standby zookeeper servers to serve as regionserver, right? it's preferred to keep the membership of the consensus algorithm to be fixed, within a fixed small set of predesignated machines. it can noticeably complicate the total design/implementation to support dynamic the consensus algorithm's membership, if we permit all servers of a HBase cluster to be able to play the role of master I also agree with you two on that a total P2P architect like Dynamo/Cassandra will have a much more difficult time when handling failover. What would the steps involved moving off zk to a group of masters keeping consensus look like? stack : the rough steps I can think about for now is as below: implement a robust/reliable consensus lib(paxos/zab/raft) redesign the master based on this consensus lib. now we don't need to write out the HBase/master states such as region-assign-status, replication info, table info to an far-away/outside persistent/reliable storage such as zookeeper or another HBase system table, we just replicate them among the masters, master itself is the only truth about these states. HBASE-5487 aims for a master redesign to store the states to a system table, though can avoid the state maintenance problem derived from missed event by zookeeper's watch/notify mechanism, it still will have the problem of keeping/maintaining truth in two different sites(master memory and the system table) and we still need to be very careful at implementing. it's always a headache when the state reader/maintainer(master) and state persistence layer are not in the same server; HBASE-10295 aims for moving replication info from zookeeper to another system table. if we achieve using consensus lib within master, we can represent replication info as just another in-memory data structure, but not a different system table. (personally I don't think using zookeeper node to store replication info is an as severe problem as region-assign-status, since replication-aware logic has the inherent idempotence: it cares about only what's the final state of some replication info when it's changed, but not how it changes to the final state(the state transition process). on the contrary region assignment logic is more like a state machine, it does care about a state is transitioned from a valid previous state(it looks like transition from an invalid state when some event/state missed) otherwise the code can be pretty tricky and hard to understand/maintain. another concern for moving replication info from zookeeper to another system table is that it's hard for HBase table to represent a deep tree-like hierarchical structure (HBase table can only naturally represent not more than 3-layer structure (via row + cf + qualifier), zookeeper and in-memory data structures don't have such limited hierarchical layer problem)
          Honghua Feng made changes -
          Field Original Value New Value
          Summary Replace ZK with a paxos running within master processes to provide better master failover performance and state consistency Replace ZK with a consensus lib(paxos,zab or raft) running within master processes to provide better master failover performance and state consistency
          Hide
          Liyin Tang added a comment -

          Speaking of RAFT implementation, we, the FB hbase team, are very close to open source a Raft implementation as a library. And there are multiple potentials to integrate Raft protocol into HBase/HDFS software stack.

          Show
          Liyin Tang added a comment - Speaking of RAFT implementation, we, the FB hbase team, are very close to open source a Raft implementation as a library. And there are multiple potentials to integrate Raft protocol into HBase/HDFS software stack.
          Hide
          Honghua Feng added a comment -

          Speaking of RAFT implementation, we, the FB hbase team, are very close to open source a Raft implementation as a library. And there are multiple potentials to integrate Raft protocol into HBase/HDFS software stack.

          Liyin Tang, sounds really great.

          Show
          Honghua Feng added a comment - Speaking of RAFT implementation, we, the FB hbase team, are very close to open source a Raft implementation as a library. And there are multiple potentials to integrate Raft protocol into HBase/HDFS software stack. Liyin Tang , sounds really great.
          Hide
          stack added a comment -

          Sweet Liyin Tang. I was going to write that 1. would be fun but looks like it'd be a bit of work going by this list http://raftconsensus.github.io/. There does not seem to be a complete easy-to-integrate implementation as yet, not unless, we go JNI it. Varying group membership and compacting the logs we'd have to contrib. I suppose we'd write the logs to the local filesystem if we want edits persisted.

          On 2. above Honghua Feng, We'd have no callback/notification mechanism so it will take a little more effort replacing the zk-based mechanism. We'd have to add being able to pass state messages for say when a region has opened on a regionserver or is closing...
          On 3. and 4., agree.

          On undoing the notion of a 'master', we'd also be simplifying hbase packaging and deploy. There would be no more need to set aside machines for this special role. Lets keep chatting on this one.

          Show
          stack added a comment - Sweet Liyin Tang . I was going to write that 1. would be fun but looks like it'd be a bit of work going by this list http://raftconsensus.github.io/ . There does not seem to be a complete easy-to-integrate implementation as yet, not unless, we go JNI it. Varying group membership and compacting the logs we'd have to contrib. I suppose we'd write the logs to the local filesystem if we want edits persisted. On 2. above Honghua Feng , We'd have no callback/notification mechanism so it will take a little more effort replacing the zk-based mechanism. We'd have to add being able to pass state messages for say when a region has opened on a regionserver or is closing... On 3. and 4., agree. On undoing the notion of a 'master', we'd also be simplifying hbase packaging and deploy. There would be no more need to set aside machines for this special role. Lets keep chatting on this one.
          Hide
          Honghua Feng added a comment -

          Varying group membership and compacting the logs we'd have to contrib. I suppose we'd write the logs to the local filesystem if we want edits persisted.

          1. Yes we need the snapshot feature provided from within the lib which is used by the app/user(here is our HMaster) to reduce log files, otherwise the number of log files can increase immensely over time(a bit like the motivation of flush in HBase)--I assume you meant snapshot when you said 'compacting the logs'. and the snapshot functionality should be as a callback function implemented by user logic, and after it's done the consensus lib perform removing according log files, right? .
          2. Agree with you on that we'd write the logs to the local filesystem for persistence, no doubt here.
          3. Varying group membership has relatively lower priority than others, can safely set as a nice-to-have feature in the beginning in the light that we almost always use a pre-configured fixed set of machines as HMaster, right?

          We'd have no callback/notification mechanism so it will take a little more effort replacing the zk-based mechanism. We'd have to add being able to pass state messages for say when a region has opened on a regionserver or is closing...

          I would propose to replace zk in an incremental fashion:

          1. for region assignment status info, we move them out of zk to the embedded in-memory consensus lib instance.
          2. zk can still serve as the central truth-holder storage for the 'configuration'-like data such as replication info, since zk does it job well for such use scenario(we have analysed it more comprehensively in HBASE-1755).
          3. zk also remain as the liveness monitor for regionservers(but not for HMaster's healthy which is now handled by the consensus lib instance itself) before we implement heartbeat directly between HMaster and regionservers.
          4. for region assignment status info, since HMaster and regionservers now talk directly by sending request/response messages between HMaster and regionservers after we use in-memory consensus lib, it's natural that 'they are able to pass state messages for say when a region has opened on a regionserver or is closing'
          Show
          Honghua Feng added a comment - Varying group membership and compacting the logs we'd have to contrib. I suppose we'd write the logs to the local filesystem if we want edits persisted. Yes we need the snapshot feature provided from within the lib which is used by the app/user(here is our HMaster) to reduce log files, otherwise the number of log files can increase immensely over time(a bit like the motivation of flush in HBase)--I assume you meant snapshot when you said 'compacting the logs'. and the snapshot functionality should be as a callback function implemented by user logic, and after it's done the consensus lib perform removing according log files, right? . Agree with you on that we'd write the logs to the local filesystem for persistence, no doubt here. Varying group membership has relatively lower priority than others, can safely set as a nice-to-have feature in the beginning in the light that we almost always use a pre-configured fixed set of machines as HMaster, right? We'd have no callback/notification mechanism so it will take a little more effort replacing the zk-based mechanism. We'd have to add being able to pass state messages for say when a region has opened on a regionserver or is closing... I would propose to replace zk in an incremental fashion: for region assignment status info, we move them out of zk to the embedded in-memory consensus lib instance. zk can still serve as the central truth-holder storage for the 'configuration'-like data such as replication info, since zk does it job well for such use scenario(we have analysed it more comprehensively in HBASE-1755 ). zk also remain as the liveness monitor for regionservers(but not for HMaster's healthy which is now handled by the consensus lib instance itself) before we implement heartbeat directly between HMaster and regionservers. for region assignment status info, since HMaster and regionservers now talk directly by sending request/response messages between HMaster and regionservers after we use in-memory consensus lib, it's natural that 'they are able to pass state messages for say when a region has opened on a regionserver or is closing'
          Hide
          Enis Soztutar added a comment -

          Speaking of RAFT implementation, we, the FB hbase team, are very close to open source a Raft implementation as a library. And there are multiple potentials to integrate Raft protocol into HBase/HDFS software stack.

          Nice.

          I would propose to replace zk in an incremental fashion:

          I like the incremental approach. The only thing is that, this would require multi master setup unless we keep the logs in hdfs and be able to use a single node RAFT quorum right?

          Show
          Enis Soztutar added a comment - Speaking of RAFT implementation, we, the FB hbase team, are very close to open source a Raft implementation as a library. And there are multiple potentials to integrate Raft protocol into HBase/HDFS software stack. Nice. I would propose to replace zk in an incremental fashion: I like the incremental approach. The only thing is that, this would require multi master setup unless we keep the logs in hdfs and be able to use a single node RAFT quorum right?
          Hide
          Honghua Feng added a comment -

          The only thing is that, this would require multi master setup unless we keep the logs in hdfs and be able to use a single node RAFT quorum right?

          Do you mean master back-compatibility issue?

          No master back-compatibility when replacing zk with embedded consensus lib, nor between different incremental phases.

          My point when proposing increment approach is that the data/functionalities provided by zk have different urgency level to be moved to inside master processes, data such as region assignment info has top urgency level to be moved inside master processes, configuration-like data has less urgent level(which needs additional watch/notify feature), liveness monitoring functionality has the least urgent level(which needs additional heart-beat feature)...we can remarkably improve master failover performance and eliminate inconsistency by replicating region assignment info among master processes' memory, the foremost concern/goal of this jira.

          To eliminate ZK as a whole to reduce deploying and machine number/roles, as stack said "...simplifying hbase packaging and deploy. There would be no more need to set aside machines for this special role", we also need to implement liveness monitoring(for regionservers) and watch/notify features within master processes...

          Above features are independent and can be implemented in an incremental fashion, that's what I meant by 'incremental', but certainly we can implemented them as a whole.

          Not sure I understand your question correctly, and wonder whether I answer your question, any further clarification is welcome

          Show
          Honghua Feng added a comment - The only thing is that, this would require multi master setup unless we keep the logs in hdfs and be able to use a single node RAFT quorum right? Do you mean master back-compatibility issue? No master back-compatibility when replacing zk with embedded consensus lib, nor between different incremental phases. My point when proposing increment approach is that the data/functionalities provided by zk have different urgency level to be moved to inside master processes, data such as region assignment info has top urgency level to be moved inside master processes, configuration-like data has less urgent level(which needs additional watch/notify feature), liveness monitoring functionality has the least urgent level(which needs additional heart-beat feature)...we can remarkably improve master failover performance and eliminate inconsistency by replicating region assignment info among master processes' memory, the foremost concern/goal of this jira. To eliminate ZK as a whole to reduce deploying and machine number/roles, as stack said "...simplifying hbase packaging and deploy. There would be no more need to set aside machines for this special role", we also need to implement liveness monitoring(for regionservers) and watch/notify features within master processes... Above features are independent and can be implemented in an incremental fashion, that's what I meant by 'incremental', but certainly we can implemented them as a whole. Not sure I understand your question correctly, and wonder whether I answer your question, any further clarification is welcome
          Hide
          Enis Soztutar added a comment -

          Agreed that the most urgent thing is zk assignment. There has been at least 3 proposals so far for a master + assignment rewrite in HBASE-5487, and all want to get rid of zk and fix assignment.
          What I was trying to understand is about the deployment. I was assuming the RAFT quorum servers will be master processes as well.
          Currently it is sufficient to have 1 master, 1 backup and 3 zk servers for HA. With some master functionality implemented with RAFT but still using zk, we would need at least 3 zk servers, and 3 master servers for full HA, which is a change in the requirement for minimum HA setup.

          However, with the incremental approach, we might even implement RAFT quorum inside region server processes, so that we gradually get rid of the master role as well, and have only 1 type of server, where (2n+1) of them would act like masters (while still serving data).

          How do you imagine the typical small / medium sized deployment will be?

          Show
          Enis Soztutar added a comment - Agreed that the most urgent thing is zk assignment. There has been at least 3 proposals so far for a master + assignment rewrite in HBASE-5487 , and all want to get rid of zk and fix assignment. What I was trying to understand is about the deployment. I was assuming the RAFT quorum servers will be master processes as well. Currently it is sufficient to have 1 master, 1 backup and 3 zk servers for HA. With some master functionality implemented with RAFT but still using zk, we would need at least 3 zk servers, and 3 master servers for full HA, which is a change in the requirement for minimum HA setup. However, with the incremental approach, we might even implement RAFT quorum inside region server processes, so that we gradually get rid of the master role as well, and have only 1 type of server, where (2n+1) of them would act like masters (while still serving data). How do you imagine the typical small / medium sized deployment will be?
          Hide
          Honghua Feng added a comment -

          There has been at least 3 proposals so far for a master + assignment rewrite in HBASE-5487, and all want to get rid of zk and fix assignment.

          Agree, but all those proposals still use third-party storage(from zk to auxiliary system table) outside of master processes/machines for persisting data such as assign status information, so:

          1. the new active master needs to read those data from outside third-party storage before serving as active master after the previous master dies, hence with suboptimal master failover performance.
          2. the same data/information still be maintained in two different locations: master memory and outside third-party storage, hence with potential consistency issues

          What I was trying to understand is about the deployment...with the incremental approach, we might even implement RAFT quorum inside region server processes, so that we gradually get rid of the master role as well, and have only 1 type of server, where (2n+1) of them would act like masters (while still serving data).

          Now I can understand what you meant. If we take incremental approach then 3 zk , 3 master and N regionservers, yes it's a suboptimal setup.
          If we implement all functionalities that zk provides for HBase such as data replicating, master election, liveness monitor and watch/notify and eliminate zk totally, the deployment of a HBase is (3 master + N regionserver)
          Though it's workable eventually to concurrently run master and regionserver roles within a single server, I'm not a fan of this deployment:

          1. master and regionserver roles can affect each other, it's hard to debug/diagnose when issue arises
          2. master and regionserver are both memory-consuming, for servers concurrently running both roles we need to balance the memory usage, and for servers running only regionserver role we need regionserver memory/heap configuration different from running both roles to take full advantage of the available memory
          Show
          Honghua Feng added a comment - There has been at least 3 proposals so far for a master + assignment rewrite in HBASE-5487 , and all want to get rid of zk and fix assignment. Agree, but all those proposals still use third-party storage(from zk to auxiliary system table) outside of master processes/machines for persisting data such as assign status information, so: the new active master needs to read those data from outside third-party storage before serving as active master after the previous master dies, hence with suboptimal master failover performance. the same data/information still be maintained in two different locations: master memory and outside third-party storage, hence with potential consistency issues What I was trying to understand is about the deployment...with the incremental approach, we might even implement RAFT quorum inside region server processes, so that we gradually get rid of the master role as well, and have only 1 type of server, where (2n+1) of them would act like masters (while still serving data). Now I can understand what you meant . If we take incremental approach then 3 zk , 3 master and N regionservers, yes it's a suboptimal setup . If we implement all functionalities that zk provides for HBase such as data replicating, master election, liveness monitor and watch/notify and eliminate zk totally, the deployment of a HBase is (3 master + N regionserver) Though it's workable eventually to concurrently run master and regionserver roles within a single server, I'm not a fan of this deployment: master and regionserver roles can affect each other, it's hard to debug/diagnose when issue arises master and regionserver are both memory-consuming, for servers concurrently running both roles we need to balance the memory usage, and for servers running only regionserver role we need regionserver memory/heap configuration different from running both roles to take full advantage of the available memory
          Cosmin Lehene made changes -
          Link This issue is related to HBASE-10295 [ HBASE-10295 ]
          Hide
          Cosmin Lehene added a comment -

          Zookeeper is used also by HDFS, Kafka, Storm as well as several other systems. Is it realistic (or desirable) to assume it would go away (from an operations standpoint)? (With raft-go there's etcd, for example).

          Will a library based approach simplify the code overall or make it easier to understand? it seems that it will make at least some parts more complex.
          What aspects of the system will be improved by the lower latencies? I'm not really clear on the faster master failover benefit. Will this improve region reassignment in a manner that could not be achieved without it?

          Show
          Cosmin Lehene added a comment - Zookeeper is used also by HDFS, Kafka, Storm as well as several other systems. Is it realistic (or desirable) to assume it would go away (from an operations standpoint)? (With raft-go there's etcd, for example). Will a library based approach simplify the code overall or make it easier to understand? it seems that it will make at least some parts more complex. What aspects of the system will be improved by the lower latencies? I'm not really clear on the faster master failover benefit. Will this improve region reassignment in a manner that could not be achieved without it?
          Hide
          Honghua Feng added a comment -

          Zookeeper is used also by HDFS, Kafka, Storm as well as several other systems. Is it realistic (or desirable) to assume it would go away (from an operations standpoint)? (With raft-go there's etcd, for example).

          Not quite. For applications which just need a simple/reliable storage for storing small amount of configuration or meta like data with sparse access, Zookeeper still has advantage over raft-based solution:

          1. Economical: Zookeeper can be shared among a big number of applications with such simple storage requirement, but for raft-based solution each application need to allocate its own separate 3-5 nodes for replication purpose.
          2. Simple: Application code is simple by just calling Zookeeper API to create/read/write node/data to/from Zookeeper, while raft-based solution need to write more complex interacting code between application code and raft library such as passing/converting raft write/log to in-memory data structure with application-specific meaning, and snapshot making and log truncate, etc.
          3. Convenient: Zookeeper's tree-like hierarchical structure for organizing data and watch/notify mechanism is convenient for application to represent data and organize code, as long as watch/notify mechanism is not used to implement state-machine-like logic with the 'A process changes a znode, B process watches that znode and then reads the znode value to trigger its state-machine' pattern
            In short, raft-based solution is somewhat an overkill for such applications with simple, small and sparse-access storage requirement.

          Will a library based approach simplify the code overall or make it easier to understand? it seems that it will make at least some parts more complex. What aspects of the system will be improved by the lower latencies? I'm not really clear on the faster master failover benefit. Will this improve region reassignment in a manner that could not be achieved without it?

          For HMaster, raft-based approach has below benefits:

          1. For assign(split/merge) state machine logic, raft-based approach eliminates the potentials for state inconsistency. HMaster's current implementation suffers from two facts which can result in consistency issues: 1) Zookeeper's watch/notify mechanism is used to maintain the assign state machine; 2) assign status is stored in multiple places(master's memory, Zookeeper), so it always has the headache to guarantee the data consistency among those different places
          2. Better master failover performance. New master can immediately play as active master after previous active one dies, without first reading from external storage to rebuild in-memory state(current HBase's approach) or querying from regionservers and rebuild the in-memory state about the cluster(Bigtable's approach, personally I think Bigtable's master startup code should be even more complicated than HBase since it needs to reason out the correct 'cluster state' by response from regionservers, not say regionservers can fail during master startup process...)
          3. Better whole-cluster restart performance. For cluster with big number of regions(say 10K-100K), during the cluster restart master need to do assignment for all the regions, hence result in access to Zookeeper in a very frequent fashion, due to the fact that only a single IO thread and a single event thread are used by master to communicate with Zookeeper, the interaction with Zookeeper can be an obvious bottleneck for the cluster restart, while raft-based approach can perform much better here.
          4. Simpler deployment. HBase with raft-based approach's deployment is '3 master + n regionserver', while Zookeeper solution is ' 3 Zookeeper + 2+ master + n regionserver'. We can't assume applications running HBase can always find a shared Zookeeper to use.
          5. Isolation. Zookeeper-approach HBase cluster can be affected by other applications which may slow down or even turn down by abusing or misusing the shared Zookeeper that our HBase relies on, while raft-based doesn't need to worry about this.
          Show
          Honghua Feng added a comment - Zookeeper is used also by HDFS, Kafka, Storm as well as several other systems. Is it realistic (or desirable) to assume it would go away (from an operations standpoint)? (With raft-go there's etcd, for example). Not quite. For applications which just need a simple/reliable storage for storing small amount of configuration or meta like data with sparse access, Zookeeper still has advantage over raft-based solution: Economical: Zookeeper can be shared among a big number of applications with such simple storage requirement, but for raft-based solution each application need to allocate its own separate 3-5 nodes for replication purpose. Simple: Application code is simple by just calling Zookeeper API to create/read/write node/data to/from Zookeeper, while raft-based solution need to write more complex interacting code between application code and raft library such as passing/converting raft write/log to in-memory data structure with application-specific meaning, and snapshot making and log truncate, etc. Convenient: Zookeeper's tree-like hierarchical structure for organizing data and watch/notify mechanism is convenient for application to represent data and organize code, as long as watch/notify mechanism is not used to implement state-machine-like logic with the 'A process changes a znode, B process watches that znode and then reads the znode value to trigger its state-machine' pattern In short, raft-based solution is somewhat an overkill for such applications with simple, small and sparse-access storage requirement. Will a library based approach simplify the code overall or make it easier to understand? it seems that it will make at least some parts more complex. What aspects of the system will be improved by the lower latencies? I'm not really clear on the faster master failover benefit. Will this improve region reassignment in a manner that could not be achieved without it? For HMaster, raft-based approach has below benefits: For assign(split/merge) state machine logic, raft-based approach eliminates the potentials for state inconsistency. HMaster's current implementation suffers from two facts which can result in consistency issues: 1) Zookeeper's watch/notify mechanism is used to maintain the assign state machine; 2) assign status is stored in multiple places(master's memory, Zookeeper), so it always has the headache to guarantee the data consistency among those different places Better master failover performance. New master can immediately play as active master after previous active one dies, without first reading from external storage to rebuild in-memory state(current HBase's approach) or querying from regionservers and rebuild the in-memory state about the cluster(Bigtable's approach, personally I think Bigtable's master startup code should be even more complicated than HBase since it needs to reason out the correct 'cluster state' by response from regionservers, not say regionservers can fail during master startup process...) Better whole-cluster restart performance. For cluster with big number of regions(say 10K-100K), during the cluster restart master need to do assignment for all the regions, hence result in access to Zookeeper in a very frequent fashion, due to the fact that only a single IO thread and a single event thread are used by master to communicate with Zookeeper, the interaction with Zookeeper can be an obvious bottleneck for the cluster restart, while raft-based approach can perform much better here. Simpler deployment. HBase with raft-based approach's deployment is '3 master + n regionserver', while Zookeeper solution is ' 3 Zookeeper + 2+ master + n regionserver'. We can't assume applications running HBase can always find a shared Zookeeper to use. Isolation. Zookeeper-approach HBase cluster can be affected by other applications which may slow down or even turn down by abusing or misusing the shared Zookeeper that our HBase relies on, while raft-based doesn't need to worry about this.
          Hide
          Jacky007 added a comment -

          ZK is not good enough, but do it by your own will make things worse.
          The only real problem I can see is that ZK is not strong consistent.

          But zk as a communication channel is fragile due to its one-time watch and asynchronous notification mechanism which together can leads to missed events(hence missed messages),

          This can be done with the existed API (but performance is much inefficient than chubby).

          Show
          Jacky007 added a comment - ZK is not good enough, but do it by your own will make things worse. The only real problem I can see is that ZK is not strong consistent. But zk as a communication channel is fragile due to its one-time watch and asynchronous notification mechanism which together can leads to missed events(hence missed messages), This can be done with the existed API (but performance is much inefficient than chubby).
          Hide
          Honghua Feng added a comment -

          Zookeeper is used also by HDFS, Kafka, Storm as well as several other systems. Is it realistic (or desirable) to assume it would go away (from an operations standpoint)?

          Maybe my above answer to this question is a bit too general. To be specific for HDFS, personally I think name node also can apply the idea of this jira: the meta data maintained by name nodes can be replicated among all name nodes using consensus lib, removing ZKFC and JournalNodes. I was surprised by finding so many roles/processes introduced to accomplish HDFS HA for the first time, it would be due to some historical reason, right?

          Show
          Honghua Feng added a comment - Zookeeper is used also by HDFS, Kafka, Storm as well as several other systems. Is it realistic (or desirable) to assume it would go away (from an operations standpoint)? Maybe my above answer to this question is a bit too general . To be specific for HDFS, personally I think name node also can apply the idea of this jira: the meta data maintained by name nodes can be replicated among all name nodes using consensus lib, removing ZKFC and JournalNodes. I was surprised by finding so many roles/processes introduced to accomplish HDFS HA for the first time, it would be due to some historical reason, right?
          Hide
          Honghua Feng added a comment -

          ZK is not good enough, but do it by your own will make things worse.

          Would you list the detailed reasons for this statement? Do you mean the coding complexity and correctness risk when implementing our own consensus lib when saying 'will make things worse'? Or anything else?

          The only real problem I can see is that ZK is not strong consistent.

          ZK itself should be strong consistent, right? But our ZK usage of 'A process changes a znode, B process watches that znode and then reads the znode value to trigger its state-machine' pattern for maintaining the state-machine logic(especially assign state-machine) results in the inconsistency problem in HMaster...but the data/states we put in ZK still have consistency, right?

          This can be done with the existed API (but performance is much inefficient than chubby).

          Actually if we make HMaster the arbitrator and only HMaster can write to ZK, ZK acts as the only truth holder, regionservers can't write/update the states directly to ZK but talk to HMaster and HMaster updates to ZK for them...this way the current inconsistency issue of HMaster can be remarkably alleviated. But still need careful treatment/handling for maintaining the consistency between ZK and HMaster's in-memory data...

          Show
          Honghua Feng added a comment - ZK is not good enough, but do it by your own will make things worse. Would you list the detailed reasons for this statement? Do you mean the coding complexity and correctness risk when implementing our own consensus lib when saying 'will make things worse'? Or anything else? The only real problem I can see is that ZK is not strong consistent. ZK itself should be strong consistent, right? But our ZK usage of 'A process changes a znode, B process watches that znode and then reads the znode value to trigger its state-machine' pattern for maintaining the state-machine logic(especially assign state-machine) results in the inconsistency problem in HMaster...but the data/states we put in ZK still have consistency, right? This can be done with the existed API (but performance is much inefficient than chubby). Actually if we make HMaster the arbitrator and only HMaster can write to ZK, ZK acts as the only truth holder, regionservers can't write/update the states directly to ZK but talk to HMaster and HMaster updates to ZK for them...this way the current inconsistency issue of HMaster can be remarkably alleviated. But still need careful treatment/handling for maintaining the consistency between ZK and HMaster's in-memory data...
          Sean Busbey made changes -
          Link This issue relates to ACCUMULO-715 [ ACCUMULO-715 ]
          Mikhail Antonov made changes -
          Link This issue is related to HBASE-10866 [ HBASE-10866 ]
          Hide
          Pablo Medina added a comment -

          Hi all,
          I'm the author of CKite (https://github.com/pablosmedina/ckite) a JVM library implementation of Raft. It covers all the Raft consensus protocol functionality and it has an easy to use API for both Java and Scala. I think it can be easily integrated into HBase designing a good StateMachine and its respective Commands. CKite can provide some listeners about elections and others specific listeners can be implemented in the StateMachine itself as the reception of Commands. I would be glad to start working on a draft integration to demonstrate its capabilities. I have experience working with HBase in production so It would be exciting to collaborate on this. If you were to start this in an incremental way, which step would be the better one to start with?

          Show
          Pablo Medina added a comment - Hi all, I'm the author of CKite ( https://github.com/pablosmedina/ckite ) a JVM library implementation of Raft. It covers all the Raft consensus protocol functionality and it has an easy to use API for both Java and Scala. I think it can be easily integrated into HBase designing a good StateMachine and its respective Commands. CKite can provide some listeners about elections and others specific listeners can be implemented in the StateMachine itself as the reception of Commands. I would be glad to start working on a draft integration to demonstrate its capabilities. I have experience working with HBase in production so It would be exciting to collaborate on this. If you were to start this in an incremental way, which step would be the better one to start with?
          Hide
          Lars Hofhansl added a comment -

          Pablo Medina, great. I'd be reluctant pulling a Scala dependency into HBase, but it would be great to have access to an easy Raft API in HBase. I imagine we can use that even to have multiple replicas of regions in multiple RegionServers eventually.

          Show
          Lars Hofhansl added a comment - Pablo Medina , great. I'd be reluctant pulling a Scala dependency into HBase, but it would be great to have access to an easy Raft API in HBase. I imagine we can use that even to have multiple replicas of regions in multiple RegionServers eventually.
          Hide
          Mikhail Antonov added a comment -

          Pablo Medina you may want to take a look at HBASE-10909 (and pdf attached to it). As ZooKeeper is used in many places throughout, integration with other consensus libs to replace it would require certain refactoring of current codebase.

          Show
          Mikhail Antonov added a comment - Pablo Medina you may want to take a look at HBASE-10909 (and pdf attached to it). As ZooKeeper is used in many places throughout, integration with other consensus libs to replace it would require certain refactoring of current codebase.
          Hide
          Mikhail Antonov added a comment -

          Shall we update the title of this jira to reflect the fact that this consensus lib thing is broader than just master process (and failover performance specifically)?

          Show
          Mikhail Antonov added a comment - Shall we update the title of this jira to reflect the fact that this consensus lib thing is broader than just master process (and failover performance specifically)?
          Mikhail Antonov made changes -
          Link This issue requires HBASE-10909 [ HBASE-10909 ]
          Hide
          Mikhail Antonov added a comment -

          Linked to hbase-10909 as it seems like a prerequisite

          Show
          Mikhail Antonov added a comment - Linked to hbase-10909 as it seems like a prerequisite
          Hide
          Pablo Medina added a comment -

          Mikhail Antonov Great. I'll take a look at your patches and see If I can help. It is a great start.

          Show
          Pablo Medina added a comment - Mikhail Antonov Great. I'll take a look at your patches and see If I can help. It is a great start.
          Mikhail Antonov made changes -
          Link This issue is related to HBASE-10866 [ HBASE-10866 ]

            People

            • Assignee:
              Unassigned
              Reporter:
              Honghua Feng
            • Votes:
              0 Vote for this issue
              Watchers:
              44 Start watching this issue

              Dates

              • Created:
                Updated:

                Development