Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: quorum
    • Labels:
      None
    1. patch-20090423.txt
      347 kB
      Andrew Farmer
    2. zab.diff
      297 kB
      Andrew Farmer

      Activity

      Hide
      Flavio Junqueira added a comment -

      I don't know of anyone actively working on this. At the same time, one must be careful when plugging in a protocol implementation for a paxos variant because of the reasons we explain in the zab paper.

      Show
      Flavio Junqueira added a comment - I don't know of anyone actively working on this. At the same time, one must be careful when plugging in a protocol implementation for a paxos variant because of the reasons we explain in the zab paper.
      Hide
      Luke Lu added a comment -

      Anyone working on this? Recent atomic broadcast protocols (Ring-Paxos, S-Paxos etc.) seem to have much higher write throughput than ZAB, especially as the number of replicas grows.

      Show
      Luke Lu added a comment - Anyone working on this? Recent atomic broadcast protocols (Ring-Paxos, S-Paxos etc.) seem to have much higher write throughput than ZAB, especially as the number of replicas grows.
      Hide
      Andrew Farmer added a comment -

      Pursuant to some suggestions from Ben and Mahadev, I'm attaching a modified patch that gives a diff between the files we started with and the files we ended up with.

      Show
      Andrew Farmer added a comment - Pursuant to some suggestions from Ben and Mahadev, I'm attaching a modified patch that gives a diff between the files we started with and the files we ended up with.
      Hide
      Patrick Hunt added a comment -

      I talked with Owen (Hadoop PMC chair), specifically what our options might be. He re-iterated
      that we should not accept the code without a grant (as I mentioned earlier).

      I reviewed some options with him, such as creating a branch or something, none of those options
      are particular useful/possible though.

      One idea we discussed was using Git to help merge the code. Specifically you could create a
      git repo from your svn repo, then attach the git repo to a jira (this would be the "grant"). We could
      use this to review and experiment with merge. You could then use the ZooKeeper git repo
      http://git.apache.org/
      to merge the results of your changes into the latest trunk code, and generate a patch set from that.

      Perhaps you can think about this relative to the changes you've made and your current status.

      Show
      Patrick Hunt added a comment - I talked with Owen (Hadoop PMC chair), specifically what our options might be. He re-iterated that we should not accept the code without a grant (as I mentioned earlier). I reviewed some options with him, such as creating a branch or something, none of those options are particular useful/possible though. One idea we discussed was using Git to help merge the code. Specifically you could create a git repo from your svn repo, then attach the git repo to a jira (this would be the "grant"). We could use this to review and experiment with merge. You could then use the ZooKeeper git repo http://git.apache.org/ to merge the results of your changes into the latest trunk code, and generate a patch set from that. Perhaps you can think about this relative to the changes you've made and your current status.
      Hide
      Andrew Farmer added a comment - - edited

      Other Andrew here. (I'm also on the HMC clinic team.)

      Basically, the issue is that a lot of the changes we made have been kind of "patch-unfriendly" - we've moved and renamed a lot of files, and that can't really be reflected well by a patch file. (We tried generating a straight patch between our repository and yours, and it ended up being something like 5 MB.)

      With that all in mind, though, I'm attaching a REALLY ROUGH patch that simply adds our current version of Zab, as well as its respective tests, to the current SVN trunk revision of Zookeeper. Hopefully this should resolve the legal issues.

      What it doesn't do is:

      1) It doesn't make Zookeeper use Zab for anything. As a result, there's a lot of duplicated code now - Zookeeper will need to be modified significantly to run against the Zab API. All it does is add a bunch of code to the source tree.

      2) It also doesn't port in some of the changes you folks have made to code that's within Zab's ambit. (What's included is basically everything that doesn't involve either clients or the data tree: leader election, proposal handling, and logging/persistence.)

      3) Finally, it's not quite complete. We're still working on implementing syncs, as well as doing some further tests.

      Hopefully this is enough to start taking a look at, though... we'll keep you updated.

      Show
      Andrew Farmer added a comment - - edited Other Andrew here. (I'm also on the HMC clinic team.) Basically, the issue is that a lot of the changes we made have been kind of "patch-unfriendly" - we've moved and renamed a lot of files, and that can't really be reflected well by a patch file. (We tried generating a straight patch between our repository and yours, and it ended up being something like 5 MB.) With that all in mind, though, I'm attaching a REALLY ROUGH patch that simply adds our current version of Zab, as well as its respective tests, to the current SVN trunk revision of Zookeeper. Hopefully this should resolve the legal issues. What it doesn't do is: 1) It doesn't make Zookeeper use Zab for anything. As a result, there's a lot of duplicated code now - Zookeeper will need to be modified significantly to run against the Zab API. All it does is add a bunch of code to the source tree. 2) It also doesn't port in some of the changes you folks have made to code that's within Zab's ambit. (What's included is basically everything that doesn't involve either clients or the data tree: leader election, proposal handling, and logging/persistence.) 3) Finally, it's not quite complete. We're still working on implementing syncs, as well as doing some further tests. Hopefully this is enough to start taking a look at, though... we'll keep you updated.
      Hide
      Mahadev konar added a comment - - edited

      hi andrew,
      as pat pointed out that we would not be able to merge an external branch without a code grant as we have in patch submissions.
      would it be possible for you guys to break up the patch like -

      1) patch for changes in persistence
      2) patch for changes in quorum

      something liek that? if not creating a single patch is fine...

      We would like to include your changes in Zookeeper but it would be difficult for us to find bandwidth to review an external repository. Also it would be great if you can include the list of changes (concretely) you have made for Zas on this jira.
      Also, we should be able to meet with you later in may.. we can discuss that outside of this jira...

      Show
      Mahadev konar added a comment - - edited hi andrew, as pat pointed out that we would not be able to merge an external branch without a code grant as we have in patch submissions. would it be possible for you guys to break up the patch like - 1) patch for changes in persistence 2) patch for changes in quorum something liek that? if not creating a single patch is fine... We would like to include your changes in Zookeeper but it would be difficult for us to find bandwidth to review an external repository. Also it would be great if you can include the list of changes (concretely) you have made for Zas on this jira. Also, we should be able to meet with you later in may.. we can discuss that outside of this jira...
      Hide
      Patrick Hunt added a comment -

      There are really 2 reasons why we need you submit as a patch if you want the changes
      included in future releases of Apache ZooKeeper:

      1) we need the code to be submitted through JIRA for legal reasons. In particular when you
      submit the changes you need to check off the box that says:

      Grant license to ASF for inclusion in ASF works (as per the Apache License §5)
      Contributions intended for inclusion in ASF products (eg. patches, code) must be licensed to ASF under the terms of the Apache License. Other attachments (eg. log dumps, test cases) need not be.

      You can submit multiple patches, as well as a script/description of how to apply.

      Here's an example:
      https://issues.apache.org/jira/browse/ZOOKEEPER-234?focusedCommentId=12663566&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12663566

      2) we don't know your changes as well as you do. How are we going to apply them if you can't?

      We are very interested to review/include your changes. We'd be happy to help with any advice/support.

      Show
      Patrick Hunt added a comment - There are really 2 reasons why we need you submit as a patch if you want the changes included in future releases of Apache ZooKeeper: 1) we need the code to be submitted through JIRA for legal reasons. In particular when you submit the changes you need to check off the box that says: Grant license to ASF for inclusion in ASF works (as per the Apache License §5) Contributions intended for inclusion in ASF products (eg. patches, code) must be licensed to ASF under the terms of the Apache License. Other attachments (eg. log dumps, test cases) need not be. You can submit multiple patches, as well as a script/description of how to apply. Here's an example: https://issues.apache.org/jira/browse/ZOOKEEPER-234?focusedCommentId=12663566&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12663566 2) we don't know your changes as well as you do. How are we going to apply them if you can't? We are very interested to review/include your changes. We'd be happy to help with any advice/support.
      Hide
      Andrew Carman added a comment -

      We now have public read access to our codebase:
      https://svn.cs.hmc.edu/svn/linkedin08/zab-multibranch/
      Feel free to look around. It's still quite fluid as we implement the final few features.

      Show
      Andrew Carman added a comment - We now have public read access to our codebase: https://svn.cs.hmc.edu/svn/linkedin08/zab-multibranch/ Feel free to look around. It's still quite fluid as we implement the final few features.
      Hide
      Andrew Carman added a comment -

      We'd be happy to show you what we've got, but we don't think we can deliver it as a patch. We've deleted a large number of files, touched every file in zab, zab/quorum, and zab/persistence, and changed a lot of the jute generated code. We're looking at a way to get you public read access to our repository, but until then is there some other way we could get it to you?

      We talked with Jean-Luc today and we all thought it might be a good idea for us to come back up to LinkedIn at the end of the semester to present our work to them. Would the Zookeeper team be available to sit in on our presentation and possibly do a code review (like we did last fall, but reversed) on May 11, 12, or 13th?

      Lastly, we think we've come up with a solution to the returning zxid's problem. Instead of returning zxid's when you propose, we'll return a ZabTxnCookie object which can be used to identify which proposals came from yourself and which came from other nodes. There will be an almost unique local id for each proposal, and when it is committed, it will also get the zxid, which can be used by the application layer as a unique id. We propose the signature below. Any comments or suggestions?

      ZabTxnCookie.java
      /**
       * An identifier for transactions that should be opaque to the user but useful
       * for comparing if two transactions are the same or not. By design it is used
       * by systems implementing the Zab and ZabCallback interfaces because a
       * ZabTxnCookie will be returned when you make a proposal (and a sync) and
       * then passed when commit is called so that the client can match their
       * proposals with commits.
       */
      public class ZabTxnCookie {
          /**
           * A unique identifier for each server, this is assigned by the config
           * files when Zab is being set up.
           */
          private long serverId;
      
          /**
           * The zxid assigned by the leader to a committed proposal. This will only
           * exist on committed proposals once they are passed to deliver. This CAN
           * be used as a unique identifier for each proposal.
           */
          private long zxid;
          /**
           * A probably unique identifier for each proposal. Its most significant
           * 32-bits are the bottom 32-bits of the system time in milliseconds when
           * the node starts up (so it's reset each time the server goes down). The
           * bottom 32-bits are just a counter that's incremented on each proposal.
           * So this number will not be unique if the server goes down and starts up
           * exactly n*2^32 milliseconds after the first time (n>0).
           */
          private long localId;
      
          public boolean equals(ZabTxnCookie other);
      
          /**
           * Returns a unique identifier for this proposal, however the identifier
           * is only valid for proposals that have been committed. So this method
           * should only be called once a transaction is delivered to you, never
           * just after making a proposal. This identifier is guaranteed to be
           * sequentially increasing and unique even across server failures.
           * 
           * @return A unique identifier for this proposal if it has been committed,
           *         otherwise this number is invalid.
           */
          public long getUniqueId();
      }
      
      Show
      Andrew Carman added a comment - We'd be happy to show you what we've got, but we don't think we can deliver it as a patch. We've deleted a large number of files, touched every file in zab, zab/quorum, and zab/persistence, and changed a lot of the jute generated code. We're looking at a way to get you public read access to our repository, but until then is there some other way we could get it to you? We talked with Jean-Luc today and we all thought it might be a good idea for us to come back up to LinkedIn at the end of the semester to present our work to them. Would the Zookeeper team be available to sit in on our presentation and possibly do a code review (like we did last fall, but reversed) on May 11, 12, or 13th? Lastly, we think we've come up with a solution to the returning zxid's problem. Instead of returning zxid's when you propose, we'll return a ZabTxnCookie object which can be used to identify which proposals came from yourself and which came from other nodes. There will be an almost unique local id for each proposal, and when it is committed, it will also get the zxid, which can be used by the application layer as a unique id. We propose the signature below. Any comments or suggestions? ZabTxnCookie.java /** * An identifier for transactions that should be opaque to the user but useful * for comparing if two transactions are the same or not. By design it is used * by systems implementing the Zab and ZabCallback interfaces because a * ZabTxnCookie will be returned when you make a proposal (and a sync) and * then passed when commit is called so that the client can match their * proposals with commits. */ public class ZabTxnCookie { /** * A unique identifier for each server, this is assigned by the config * files when Zab is being set up. */ private long serverId; /** * The zxid assigned by the leader to a committed proposal. This will only * exist on committed proposals once they are passed to deliver. This CAN * be used as a unique identifier for each proposal. */ private long zxid; /** * A probably unique identifier for each proposal. Its most significant * 32-bits are the bottom 32-bits of the system time in milliseconds when * the node starts up (so it's reset each time the server goes down). The * bottom 32-bits are just a counter that's incremented on each proposal. * So this number will not be unique if the server goes down and starts up * exactly n*2^32 milliseconds after the first time (n>0). */ private long localId; public boolean equals(ZabTxnCookie other); /** * Returns a unique identifier for this proposal, however the identifier * is only valid for proposals that have been committed. So this method * should only be called once a transaction is delivered to you, never * just after making a proposal. This identifier is guaranteed to be * sequentially increasing and unique even across server failures. * * @ return A unique identifier for this proposal if it has been committed, * otherwise this number is invalid. */ public long getUniqueId(); }
      Hide
      Flavio Junqueira added a comment -

      Andrew, It would be really good for us to see a preliminary patch, just to have an idea of how you have implemented a few things.

      Although perhaps not exactly appropriate, Ben and I had a discussion offline about having or not propose returning a zxid. Currently, only the leader proposes, since broadcast messages are requests transformed into idempotent transactions. If we follow this model, then returning a zxid is not a problem because the leader is the one generating a zxid and a propose call does not block in this case.

      I understand, however, that you are interested in having Zab in such a way that any process in the ensemble can propose. If any process can propose, then it might not make sense to have a call to propose returning a zxid. The problem of not returning a zxid is that currently the broadcast layer and the application layer are tightly coupled and separating them creates a potential problem for generating snapshots that we use to recover or bring a new follower up to speed. In a little more detail, if we separate the state from the atomic broadcast and return no zxid, then the service layer on top of zab will have no such a notion of transaction identifiers. We currently use such identifiers to generate and guarantee a consistent state transfer.

      Show
      Flavio Junqueira added a comment - Andrew, It would be really good for us to see a preliminary patch, just to have an idea of how you have implemented a few things. Although perhaps not exactly appropriate, Ben and I had a discussion offline about having or not propose returning a zxid. Currently, only the leader proposes, since broadcast messages are requests transformed into idempotent transactions. If we follow this model, then returning a zxid is not a problem because the leader is the one generating a zxid and a propose call does not block in this case. I understand, however, that you are interested in having Zab in such a way that any process in the ensemble can propose. If any process can propose, then it might not make sense to have a call to propose returning a zxid. The problem of not returning a zxid is that currently the broadcast layer and the application layer are tightly coupled and separating them creates a potential problem for generating snapshots that we use to recover or bring a new follower up to speed. In a little more detail, if we separate the state from the atomic broadcast and return no zxid, then the service layer on top of zab will have no such a notion of transaction identifiers. We currently use such identifiers to generate and guarantee a consistent state transfer.
      Hide
      Andrew Carman added a comment -

      Update: We just got Multi-Zab up and running (our name for the multinode, networked version of Zab - basically we moved over all of the quorum components). We're working on preliminary testing and have already found and fixed a few bugs we introduced. We're continuing now with more thorough testing (eventually some performance as well as functionality tests) and code maintenance (comments, refactoring, etc). The only features we're missing are syncs and the ability to return a zxid when calling propose. Syncs are definitely a feature we will add (in the manner they exist now, so we avoid the two round trips of atomic broadcast), but we would like some input on returning zxid's.

      How should they work? When a Zab node has the propose method called on it, the API says it should return a zxid of that proposal. However, the leader has to assign the zxid, so we either need the client to wait for the proposal to be sent to the leader, assigned a zxid, and sent back out to be voted on, or we don't return a zxid. Is there a reason we need to return zxid's? Is it only so implementing applications can identify which committed proposals originated from themselves? Could we just add a server id and local id instead of having to wait for the zxid from the leader?

      Show
      Andrew Carman added a comment - Update: We just got Multi-Zab up and running (our name for the multinode, networked version of Zab - basically we moved over all of the quorum components). We're working on preliminary testing and have already found and fixed a few bugs we introduced. We're continuing now with more thorough testing (eventually some performance as well as functionality tests) and code maintenance (comments, refactoring, etc). The only features we're missing are syncs and the ability to return a zxid when calling propose. Syncs are definitely a feature we will add (in the manner they exist now, so we avoid the two round trips of atomic broadcast), but we would like some input on returning zxid's. How should they work? When a Zab node has the propose method called on it, the API says it should return a zxid of that proposal. However, the leader has to assign the zxid, so we either need the client to wait for the proposal to be sent to the leader, assigned a zxid, and sent back out to be voted on, or we don't return a zxid. Is there a reason we need to return zxid's? Is it only so implementing applications can identify which committed proposals originated from themselves? Could we just add a server id and local id instead of having to wait for the zxid from the leader?
      Hide
      Flavio Junqueira added a comment -

      Andrew, Thanks for your clarification. As part of the ZooKeeper community, you're also part of the ZK team( ). I believe that other members of the community, including myself, will be happy to help you out with any issue you may have with the integration.

      Regarding the sync operation, I don't think it is a good idea to have it in the way you're proposing. Sync is a way to provide cheap linearizable reads; cheap in the sense that it doesn't require two rounds of atomic broadcast messages. Your solution works, but makes it more expensive unnecessarily.

      Show
      Flavio Junqueira added a comment - Andrew, Thanks for your clarification. As part of the ZooKeeper community, you're also part of the ZK team( ). I believe that other members of the community, including myself, will be happy to help you out with any issue you may have with the integration. Regarding the sync operation, I don't think it is a good idea to have it in the way you're proposing. Sync is a way to provide cheap linearizable reads; cheap in the sense that it doesn't require two rounds of atomic broadcast messages. Your solution works, but makes it more expensive unnecessarily.
      Hide
      Andrew Carman added a comment -

      On a separate note, we were looking at the way sync's work in the current system and it seems like they don't really fit into the Zab interface as we've specified it. We are probably just going to drop them, and leave it up to the application using Zab to implement syncing if it needs it. We are fairly sure that this is possible by initiating a proposal that just says "sync" and waiting for it to be delivered, but we wanted to check to make sure that this would actually, correctly replicate ZK's sync functionality. Does it?

      Show
      Andrew Carman added a comment - On a separate note, we were looking at the way sync's work in the current system and it seems like they don't really fit into the Zab interface as we've specified it. We are probably just going to drop them, and leave it up to the application using Zab to implement syncing if it needs it. We are fairly sure that this is possible by initiating a proposal that just says "sync" and waiting for it to be delivered, but we wanted to check to make sure that this would actually, correctly replicate ZK's sync functionality. Does it?
      Hide
      Andrew Carman added a comment -

      Sorry, I guess that wasn't a very clear update. We're not finished with Zab, just the equivalent of a single node, standalone ZKS. We are partway through pulling out the multinode portions of the system and we should be ready to test it soon. Once we are finished and have Zab separated from ZK, we aren't going to have time to rework ZK to use Zab, so it probably can't be included in a release any time soon. We were under the impression that that integration was something the ZK team would be doing, and we were wondering if there is a timeline for that and what it looks like.

      Show
      Andrew Carman added a comment - Sorry, I guess that wasn't a very clear update. We're not finished with Zab, just the equivalent of a single node, standalone ZKS. We are partway through pulling out the multinode portions of the system and we should be ready to test it soon. Once we are finished and have Zab separated from ZK, we aren't going to have time to rework ZK to use Zab, so it probably can't be included in a release any time soon. We were under the impression that that integration was something the ZK team would be doing, and we were wondering if there is a timeline for that and what it looks like.
      Hide
      Flavio Junqueira added a comment -

      Andrew, That's great news! I don't see any major impediment to including your changes in our next 3.2.0 release, if it is working ok.
      Perhaps the best way to proceed at this point is to have you guys submitting some patches and testing using our unit and system tests. Does it sound good to you?

      Show
      Flavio Junqueira added a comment - Andrew, That's great news! I don't see any major impediment to including your changes in our next 3.2.0 release, if it is working ok. Perhaps the best way to proceed at this point is to have you guys submitting some patches and testing using our unit and system tests. Does it sound good to you?
      Hide
      Andrew Carman added a comment -

      Thought we'd give a status update:
      We've separated the standalone version of Zab (including all the logging) and are working on moving over the stuff from the quorum package now.
      Jean-Luc at LinkedIn has expressed interest in knowing how long it might take for Zookeeper to switch over to a Zab base once we can provide the code. Are you still planning on doing that once we are done? Our belief here is that you had milestones planned out at least six months or so, but if you could give us further information we'd appreciate it.

      Show
      Andrew Carman added a comment - Thought we'd give a status update: We've separated the standalone version of Zab (including all the logging) and are working on moving over the stuff from the quorum package now. Jean-Luc at LinkedIn has expressed interest in knowing how long it might take for Zookeeper to switch over to a Zab base once we can provide the code. Are you still planning on doing that once we are done? Our belief here is that you had milestones planned out at least six months or so, but if you could give us further information we'd appreciate it.
      Hide
      Krishna Sankar added a comment -

      You are right - the current issue is learners. My question was more deeper, to see if there is enough skeleton. Again, am getting up to speed - so you might find me rushng in where angels fear to tread ;o)
      Am on the road this week - Advanced OSX at BigNerdRanch. Let me get back after my return to civilization.
      Cheers
      <k/>

      Show
      Krishna Sankar added a comment - You are right - the current issue is learners. My question was more deeper, to see if there is enough skeleton. Again, am getting up to speed - so you might find me rushng in where angels fear to tread ;o) Am on the road this week - Advanced OSX at BigNerdRanch. Let me get back after my return to civilization. Cheers <k/>
      Hide
      Flavio Junqueira added a comment -

      Hi Krishna, Could you be more specific about what you mean by adding paxos capability? We discussed adding observers (I'm not sure there is a jira about that), which are basically paxos learners, but other than that I'm not sure there is anything. Is there anything important I'm missing?

      Show
      Flavio Junqueira added a comment - Hi Krishna, Could you be more specific about what you mean by adding paxos capability? We discussed adding observers (I'm not sure there is a jira about that), which are basically paxos learners, but other than that I'm not sure there is anything. Is there anything important I'm missing?
      Hide
      Krishna Sankar added a comment -

      I might be way off, but would this be an opportunity to add paxos capability? I have some ideas and was thinking of making a proposal, but do not want to ramble unless it is relevant here.
      Cheers
      <k/>

      Show
      Krishna Sankar added a comment - I might be way off, but would this be an opportunity to add paxos capability? I have some ideas and was thinking of making a proposal, but do not want to ramble unless it is relevant here. Cheers <k/>
      Hide
      Andrew Carman added a comment -

      Ben writes:

      1) The leader assigns the zxid. The propose request is asynchronous. Only the leader issues proposals.

      2) Only the leader proposes. (We have a single proposer.) We just know it has been queued into the system. And we know the zxid.

      3) No, ephemeral node management is done above the atomic broadcast layer. We don't need to know which servers are active.

      4) Yes getState and setState belong in the API. State transfers are rather integral to the atomic broadcast because there is a practical issue: you cannot keep an infinite log of messages, so you have to be able to summarize, both for storing on disk or for bringing followers up-to-date. For example, if you are on message 1,0000,000 and a follower comes up having seen no messages, it is much more efficient to do a state transfer than to dump 1,000,000 messages to the follower. This is a general concept used by both Paxos and Isis, among others.

      Show
      Andrew Carman added a comment - Ben writes: 1) The leader assigns the zxid. The propose request is asynchronous. Only the leader issues proposals. 2) Only the leader proposes. (We have a single proposer.) We just know it has been queued into the system. And we know the zxid. 3) No, ephemeral node management is done above the atomic broadcast layer. We don't need to know which servers are active. 4) Yes getState and setState belong in the API. State transfers are rather integral to the atomic broadcast because there is a practical issue: you cannot keep an infinite log of messages, so you have to be able to summarize, both for storing on disk or for bringing followers up-to-date. For example, if you are on message 1,0000,000 and a follower comes up having seen no messages, it is much more efficient to do a state transfer than to dump 1,000,000 messages to the follower. This is a general concept used by both Paxos and Isis, among others.
      Hide
      Andrew Carman added a comment -

      Harvey Mudd Clinic Team writes:

      1) Is the assignment of zxid by Zab asynchronous? Otherwise, when calling propose we are going to have to wait for the proposal to be routed to the Zab leader, have it assign a zxid, return to the follower Zab, then return to the client, which seems like it could take a while if the connection is slow, or the leader gets bogged down.

      2) What's the status of the proposal when propose returns? Just that the Zab leader has seen it or do we know it's been delivered?

      3) The existing API doesn't appear to have any way to detect when another server comes online or goes offline. This seems like functionality that'd be useful in implementing ephemeral nodes. One possible solution might be to generalize the status() callback to indicate status changes to any node, not just the current node and leader.

      4) Do the getState() and setState() callbacks really belong in the Zab API, or do they actually belong in the (so far theoretical) logging module? Whose state, exactly, is being transferred and stored?

      Show
      Andrew Carman added a comment - Harvey Mudd Clinic Team writes: 1) Is the assignment of zxid by Zab asynchronous? Otherwise, when calling propose we are going to have to wait for the proposal to be routed to the Zab leader, have it assign a zxid, return to the follower Zab, then return to the client, which seems like it could take a while if the connection is slow, or the leader gets bogged down. 2) What's the status of the proposal when propose returns? Just that the Zab leader has seen it or do we know it's been delivered? 3) The existing API doesn't appear to have any way to detect when another server comes online or goes offline. This seems like functionality that'd be useful in implementing ephemeral nodes. One possible solution might be to generalize the status() callback to indicate status changes to any node, not just the current node and leader. 4) Do the getState() and setState() callbacks really belong in the Zab API, or do they actually belong in the (so far theoretical) logging module? Whose state, exactly, is being transferred and stored?
      Hide
      Andrew Carman added a comment -

      Flavio writes:

      1- It seems right to me to have Zab assigning zxids to requests and being the return value of propose;

      2- I see it as being the responsibility of Zab to guarantee that messages are correctly delivered by everyone or no one, so I would say that it performs the logging. However, I was under the impression that we don't log message deliveries. Instead, we log when a server acks a proposal, and this is the information we use to recover. I'm just not sure about burying it into Zab. It might be best to have it as a separate module;

      3- The proposal must have a zxid assigned as this is the return value. Method "deliver" is a callback;

      4- We don't transfer the whole history of operations up to a point. A leader either transfer a snapshot or send the difference of the transaction log to a follower (check FollowerHandler.run() and Follower.followLeader());

      5- After some discussion, we thought that it would be best to constrain any atomic broadcast implementation we use to be leader-based. In this case, we have a call to Zab to get the leader and we use it for the service as well.

      Show
      Andrew Carman added a comment - Flavio writes: 1- It seems right to me to have Zab assigning zxids to requests and being the return value of propose; 2- I see it as being the responsibility of Zab to guarantee that messages are correctly delivered by everyone or no one, so I would say that it performs the logging. However, I was under the impression that we don't log message deliveries. Instead, we log when a server acks a proposal, and this is the information we use to recover. I'm just not sure about burying it into Zab. It might be best to have it as a separate module; 3- The proposal must have a zxid assigned as this is the return value. Method "deliver" is a callback; 4- We don't transfer the whole history of operations up to a point. A leader either transfer a snapshot or send the difference of the transaction log to a follower (check FollowerHandler.run() and Follower.followLeader()); 5- After some discussion, we thought that it would be best to constrain any atomic broadcast implementation we use to be leader-based. In this case, we have a call to Zab to get the leader and we use it for the service as well.
      Hide
      Andrew Carman added a comment -

      Harvey Mudd Clinic Team writes:

      1) How do you suggest we deal with zxids? They are integral to Zab and ZK, but they are currently assigned by ZooKeeperServer. We think they should be moved into quorum. Is this what you are suggesting by indicating "propose(byte message[])" return a zxid (Zab transaction identifier)?

      2) Should Zab handle writing the log to a file or is that ZK's responsibility when "deliver" is called?

      3) How does propose work more specifically? What is the state of the proposal when propose returns? Does it return as soon as the Zab leader has seen it (and assigned a zxid) or does it wait for the proposal to be accepted (so a deliver never needs to be called).

      4) How does the state transfer work? Are we transferring the entire log up to that point? Some snapshot?

      5) We're also unclear on how the leader (both of ZK and Zab) is going to work. I think this may be up in the air (based on the Jira bug) still - in our minds the option that makes the most sense is to get rid of the ZK leader, and just have a Zab leader. Any further thoughts on this?

      Show
      Andrew Carman added a comment - Harvey Mudd Clinic Team writes: 1) How do you suggest we deal with zxids? They are integral to Zab and ZK, but they are currently assigned by ZooKeeperServer. We think they should be moved into quorum. Is this what you are suggesting by indicating "propose(byte message[])" return a zxid (Zab transaction identifier)? 2) Should Zab handle writing the log to a file or is that ZK's responsibility when "deliver" is called? 3) How does propose work more specifically? What is the state of the proposal when propose returns? Does it return as soon as the Zab leader has seen it (and assigned a zxid) or does it wait for the proposal to be accepted (so a deliver never needs to be called). 4) How does the state transfer work? Are we transferring the entire log up to that point? Some snapshot? 5) We're also unclear on how the leader (both of ZK and Zab) is going to work. I think this may be up in the air (based on the Jira bug) still - in our minds the option that makes the most sense is to get rid of the ZK leader, and just have a Zab leader. Any further thoughts on this?
      Hide
      Andrew Carman added a comment -

      We've been having some discussion via email about this issue, and we thought we should move it to the JIRA, so I've copy/pasted all the questions and responses below.

      Show
      Andrew Carman added a comment - We've been having some discussion via email about this issue, and we thought we should move it to the JIRA, so I've copy/pasted all the questions and responses below.
      Hide
      Flavio Junqueira added a comment -

      Currently, the atomic broadcast that ZooKeeper uses is deeply embedded into our implementation. This makes it difficult to change and evaluate the protocol separately. It is desirable then to have hooks for an atomic broadcast implementation to separate the service logic from the protocol.

      I have one main concern, though. The service is highly dependent upon a leader as the leader receives operations to propose from followers and it keeps track of sessions. We then have three choices with respect to the leader:

      • Eliminate the leader and distribute all functionality across the ZooKeeper servers. In this case, an atomic broadcast might use a leader or not;
      • Keep the service leader, but separate it from the atomic broadcast. In this case, the atomic broadcast might still require no single leader;
      • Force atomic broadcast protocols to use a leader, and the ZooKeeper leader to be the same as the AB one.

      Check the tutorial of Defago et al. [1] for examples of protocols that do not rely upon a fixed sequencer. To me, the second seems easier to implement given the code base we have, but the first seems cleaner with respect to design.

      Currently, the protocol is implemented mostly in Leader.java and Follower.java (Package org.apache.zookeeper.server.quorum). Here is a quick summary of how the code flows:

      1. It starts by proposing an operation on ProposalRequestProcessor.java: line 65, zks.getLeader().propose(request);
      2. Leader.propose(request) sends a Leader.PROPOSAL to every follower: line 560 followed by 503, sendPacket(pp). It also adds the request to Leader.outstandingProposals;
      3. Follower receives a Leader.PROPOSAL: line 223 of Follower.java, case Leader.PROPOSAL;
      4. Follower sends a Leader.ACK on SendAckRequestProcessor.processRequest(): line 40;
      5. FollowerHandler? on the leader server receives Leader.ACK and invoke Leader.processAck(): line 295 of FollowerHandler?.java, and line 373 of Leader.java. If there are enough acks for the next expected proposal, then leader commits;
      6. Leader server commits by removing the head of Leader.outstandingProposals and sending a Leader.COMMIT message to followers: line 519 of Leader.java;
      7. Follower receives and processes the commit message: line 237 of Follower.java, and line 122 of FollowerZooKeeperServer.java.

      [1] X. Défago, A. Schiper, and P. Urban, "Total order broadcast and multicast algorithms: Taxonomy and survey", in ACM Computing Surveys, Pages: 372 - 421, Volume 36 , Issue 4, December 2004.

      Show
      Flavio Junqueira added a comment - Currently, the atomic broadcast that ZooKeeper uses is deeply embedded into our implementation. This makes it difficult to change and evaluate the protocol separately. It is desirable then to have hooks for an atomic broadcast implementation to separate the service logic from the protocol. I have one main concern, though. The service is highly dependent upon a leader as the leader receives operations to propose from followers and it keeps track of sessions. We then have three choices with respect to the leader: Eliminate the leader and distribute all functionality across the ZooKeeper servers. In this case, an atomic broadcast might use a leader or not; Keep the service leader, but separate it from the atomic broadcast. In this case, the atomic broadcast might still require no single leader; Force atomic broadcast protocols to use a leader, and the ZooKeeper leader to be the same as the AB one. Check the tutorial of Defago et al. [1] for examples of protocols that do not rely upon a fixed sequencer. To me, the second seems easier to implement given the code base we have, but the first seems cleaner with respect to design. Currently, the protocol is implemented mostly in Leader.java and Follower.java (Package org.apache.zookeeper.server.quorum). Here is a quick summary of how the code flows: It starts by proposing an operation on ProposalRequestProcessor.java: line 65, zks.getLeader().propose(request); Leader.propose(request) sends a Leader.PROPOSAL to every follower: line 560 followed by 503, sendPacket(pp). It also adds the request to Leader.outstandingProposals; Follower receives a Leader.PROPOSAL: line 223 of Follower.java, case Leader.PROPOSAL; Follower sends a Leader.ACK on SendAckRequestProcessor.processRequest(): line 40; FollowerHandler? on the leader server receives Leader.ACK and invoke Leader.processAck(): line 295 of FollowerHandler?.java, and line 373 of Leader.java. If there are enough acks for the next expected proposal, then leader commits; Leader server commits by removing the head of Leader.outstandingProposals and sending a Leader.COMMIT message to followers: line 519 of Leader.java; Follower receives and processes the commit message: line 237 of Follower.java, and line 122 of FollowerZooKeeperServer.java. [1] X. Défago, A. Schiper, and P. Urban, "Total order broadcast and multicast algorithms: Taxonomy and survey", in ACM Computing Surveys, Pages: 372 - 421, Volume 36 , Issue 4, December 2004.

        People

        • Assignee:
          Mahadev konar
          Reporter:
          Patrick Hunt
        • Votes:
          1 Vote for this issue
          Watchers:
          6 Start watching this issue

          Dates

          • Created:
            Updated:

            Development