Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Tracking JIRA for namespace partitioning in ZK

      From the mailing list (- courtesy: Mahadev / Flavio ) , discussion during Jan 2010 -

      "Hi, Mahadev said it all, we have been thinking about it for a while, but
      >> haven't had time to work on it. I also don't think we have a jira open for
      >> it; at least I couldn't find one. But, we did put together some comments:
      >>
      >> http://wiki.apache.org/hadoop/ZooKeeper/PartitionedZookeeper
      >>
      >> One of the main issues we have observed there is that partitioning will
      >> force us to change our consistency guarantees, which is far from ideal.
      >> However, some users seem to be ok with it, but I'm not sure we have
      >> agreement.
      >>
      >> In any case, please feel free to contribute or simply express your
      >> interests so that we can take them into account.
      >>
      >> Thanks,
      >> -Flavio
      >>
      >>
      >> On Jan 15, 2010, at 12:49 AM, Mahadev Konar wrote:
      >>
      > >>> Hi kay,
      > >>> the namespace partitioning in zookeeper has been on a back burner for a
      > >>> long time. There isnt any jira open on it. There had been some
      > >>> discussions
      > >>> on this but no real work. Flavio/Ben have had this on there minds for a
      > >>> while but no real work/proposal is out yet.
      > >>>
      > >>> May I know is this something you are looking for in production?
      > >>>
      > >>> Thanks
      > >>> mahadev
      "

        Activity

        Hide
        Vishal Kathuria added a comment -

        @Alexander - is there a separate Jira for the MountRemoteZookeeper? This is exactly what we need for our scenario.
        We aren't worried about the update throughput at the moment. Our case is that we have quite a few ensembles that are created based on availability/throughput needs of different pieces of data. For example, data that needs to survive a region failure is stored in a global ensemble, but data that doesn't is stored in a regional ensemble.

        For a client, it shouldn't have to make connections to multiple ensembles to get the data that they are interested in. So a client connects to a nearby ensemble, which has mounted the data tree (ideally a subset) of other ensembles.

        Show
        Vishal Kathuria added a comment - @Alexander - is there a separate Jira for the MountRemoteZookeeper? This is exactly what we need for our scenario. We aren't worried about the update throughput at the moment. Our case is that we have quite a few ensembles that are created based on availability/throughput needs of different pieces of data. For example, data that needs to survive a region failure is stored in a global ensemble, but data that doesn't is stored in a regional ensemble. For a client, it shouldn't have to make connections to multiple ensembles to get the data that they are interested in. So a client connects to a nearby ensemble, which has mounted the data tree (ideally a subset) of other ensembles.
        Hide
        Hari A V added a comment -

        Hi,

        Running multiple ZK instances works. But it comes with more complexity in process management.

        Some cons i can think of are

        1. Different clients need to be configured with different Zookeeper ensemble
        2. To achieve HA, each application would need at least 3 Zookeepers running. Even we run multiple Zookeepers in the same machine, number of ZK instances will be considerably higher resulting in managing more number of processes. This is inevitable even if we run multiple ensembles as part of the same ZK Cluster.

        • Hari
        Show
        Hari A V added a comment - Hi, Running multiple ZK instances works. But it comes with more complexity in process management. Some cons i can think of are 1. Different clients need to be configured with different Zookeeper ensemble 2. To achieve HA, each application would need at least 3 Zookeepers running. Even we run multiple Zookeepers in the same machine, number of ZK instances will be considerably higher resulting in managing more number of processes. This is inevitable even if we run multiple ensembles as part of the same ZK Cluster. Hari
        Hide
        Alexander Shraer added a comment -

        > to maintain semantics you have to dig into the core functionality

        Not really - every leader is in charge of local operations as usual, and an observer is in charge of remote operations.
        Obviously both proposals require some changes, but I actually think this one requires less changes, and can perhaps reuse development done for ZOOKEEPER-892.

        > remote failures causes pipeline stalls

        Only if you provide the prefix-failure property, and then a failure of a remote op would only stall operations of the client who requested this property (since remote ops don't go through normal local pipeline they don't stall it). But if you don't need ordering across partitions than you probably also don't need this property...

        > and we have found that in practice when you do such partitioning you don't need ordering guarantees across partitions.

        this probably depends on the application, but if you don't need ordering among partitions I would just run multiple ZK instances.

        Show
        Alexander Shraer added a comment - > to maintain semantics you have to dig into the core functionality Not really - every leader is in charge of local operations as usual, and an observer is in charge of remote operations. Obviously both proposals require some changes, but I actually think this one requires less changes, and can perhaps reuse development done for ZOOKEEPER-892 . > remote failures causes pipeline stalls Only if you provide the prefix-failure property, and then a failure of a remote op would only stall operations of the client who requested this property (since remote ops don't go through normal local pipeline they don't stall it). But if you don't need ordering across partitions than you probably also don't need this property... > and we have found that in practice when you do such partitioning you don't need ordering guarantees across partitions. this probably depends on the application, but if you don't need ordering among partitions I would just run multiple ZK instances.
        Hide
        Benjamin Reed added a comment -

        i think there are some rather fundamental problems with the MountRemoteZooKeeper proposal: to maintain semantics you have to dig into the core functionality, remote failures cause pipeline stalls, and we have found that in practice when you do such partitioning you don't need ordering guarantees across partitions. if you do give up on ordering across partitions, you avoid introducing further complications in the pipeline and you also get nice scalability of the writes.

        Show
        Benjamin Reed added a comment - i think there are some rather fundamental problems with the MountRemoteZooKeeper proposal: to maintain semantics you have to dig into the core functionality, remote failures cause pipeline stalls, and we have found that in practice when you do such partitioning you don't need ordering guarantees across partitions. if you do give up on ordering across partitions, you avoid introducing further complications in the pipeline and you also get nice scalability of the writes.
        Hide
        Alexander Shraer added a comment -

        Hi Hari,

        Here's another proposal on how to address this issue.

        http://wiki.apache.org/hadoop/ZooKeeper/MountRemoteZookeeper

        The idea is to preserve current ZK semantics as much as possible, unlike in the proposal above where no ordering guarantees are made between partitions. We also suggest a more intuitive interface to this, where you can "mount" some part of a remote ZK namespace.

        Alex

        Show
        Alexander Shraer added a comment - Hi Hari, Here's another proposal on how to address this issue. http://wiki.apache.org/hadoop/ZooKeeper/MountRemoteZookeeper The idea is to preserve current ZK semantics as much as possible, unlike in the proposal above where no ordering guarantees are made between partitions. We also suggest a more intuitive interface to this, where you can "mount" some part of a remote ZK namespace. Alex
        Hide
        Hari A V added a comment -

        Hi Kay,

        I am looking forward to do a prototype on this. I would be very much interested to know the practical uses cases for Partitioned Zookeeper which you have in mind. As per my understanding, the very high level problem it tries to solve is write throughput scalability. i.e. When we add more Zookeeper nodes, we should be able to get more "write throughput".

        From "https://cwiki.apache.org/ZOOKEEPER/partitionedzookeeper.html"
        "By having distinct ensembles handling different portions of the state, we end up relaxing the ordering guarantees"
        How different it is from directly running separate ensembles ? One can as well run different Zookeeper cluster to achieve this right? Whether the solution also address running multiple name spaces in , say an existing 3 Node Zookeeper cluster.

        I can think of something like this -
        Currently Write operations from all clients are processed sequentially by the Leader Zookeeper. The suggestion is to provide a provision for parallel writes for unrelated data in the same ensemble. For eg: In a cluster setup, the same ZK ensebmle may be used by Hbase for its metadata and other components for cluster confuration management. We don’t need to queue these operations and perform them sequentially. They can go parallel. But still all HBase operations may still need to be sequential to keep order of operations.

        Here (http://ria101.wordpress.com/2010/05/12/locking-and-transactions-over-cassandra-using-cages/) I found another idea of Hash based partitioning for Zookeeper.
        "The solution we suggest is simply to run more than one ZooKeeper cluster for the purposes of locking and transactions, and simply to hash locks and transactions onto particular clusters".
        Here they want to address about the locks. I am thinking of performing a hash on the "root nodes" itself (or introduce partition name) and perform operations paralelly in ZK Server(In most of the scenarios, znodes "/conf" and "/leaders" may be unrelated). Its more of running multiple partitions in the same ensemble. Effectively make writes paralell in Leader ZK in an ensemble. Still need to think more on transaction logs and snapshotting aspects and how this will be affected.

        I would be glad to hear from you guys.

        • Hari
        Show
        Hari A V added a comment - Hi Kay, I am looking forward to do a prototype on this. I would be very much interested to know the practical uses cases for Partitioned Zookeeper which you have in mind. As per my understanding, the very high level problem it tries to solve is write throughput scalability. i.e. When we add more Zookeeper nodes, we should be able to get more "write throughput". From "https://cwiki.apache.org/ZOOKEEPER/partitionedzookeeper.html" "By having distinct ensembles handling different portions of the state, we end up relaxing the ordering guarantees" How different it is from directly running separate ensembles ? One can as well run different Zookeeper cluster to achieve this right? Whether the solution also address running multiple name spaces in , say an existing 3 Node Zookeeper cluster. I can think of something like this - Currently Write operations from all clients are processed sequentially by the Leader Zookeeper. The suggestion is to provide a provision for parallel writes for unrelated data in the same ensemble. For eg: In a cluster setup, the same ZK ensebmle may be used by Hbase for its metadata and other components for cluster confuration management. We don’t need to queue these operations and perform them sequentially. They can go parallel. But still all HBase operations may still need to be sequential to keep order of operations. Here ( http://ria101.wordpress.com/2010/05/12/locking-and-transactions-over-cassandra-using-cages/ ) I found another idea of Hash based partitioning for Zookeeper. "The solution we suggest is simply to run more than one ZooKeeper cluster for the purposes of locking and transactions, and simply to hash locks and transactions onto particular clusters". Here they want to address about the locks. I am thinking of performing a hash on the "root nodes" itself (or introduce partition name) and perform operations paralelly in ZK Server(In most of the scenarios, znodes "/conf" and "/leaders" may be unrelated). Its more of running multiple partitions in the same ensemble. Effectively make writes paralell in Leader ZK in an ensemble. Still need to think more on transaction logs and snapshotting aspects and how this will be affected. I would be glad to hear from you guys. Hari
        Hide
        Amre added a comment -

        This would definitely be a useful feature. However, we have a little bit different motivation for the namespace partitioning.

        We've been using Zookeeper for our research LBS and now doing some experiments with maintaining a consistent tree across multiple datacenters (in the US, Europe and Asia). Thus, we want to reduce the latency by eliminating unnecessary roundtrips between, for example, EU and US-based servers if possible. E.g., we can pre-assign particulars branches of the tree (I believe, called "containers" in the wiki page) to different ensembles with transparent rerouting when needed. So if the server in EU needs to access a namespace located in the US it should be forwarded to a different zk-ensemble, but hopefully most of the time these servers would coordinate with "local" ensemble(s) only. I was thinking about a simpler solution, e.g., just keeping distinct ensembles for different geographical zones and reroute those calls that have a different "namespace owner". But then I found a wiki page with the description of PartitionedZookeeper and figured it might be a right approach.

        Please let me know if it goes long the same motivation you have in mind for PartitionedZookeeper. Do you have something in code for namespace partitioning? I'd be happy to contribute.

        Show
        Amre added a comment - This would definitely be a useful feature. However, we have a little bit different motivation for the namespace partitioning. We've been using Zookeeper for our research LBS and now doing some experiments with maintaining a consistent tree across multiple datacenters (in the US, Europe and Asia). Thus, we want to reduce the latency by eliminating unnecessary roundtrips between, for example, EU and US-based servers if possible. E.g., we can pre-assign particulars branches of the tree (I believe, called "containers" in the wiki page) to different ensembles with transparent rerouting when needed. So if the server in EU needs to access a namespace located in the US it should be forwarded to a different zk-ensemble, but hopefully most of the time these servers would coordinate with "local" ensemble(s) only. I was thinking about a simpler solution, e.g., just keeping distinct ensembles for different geographical zones and reroute those calls that have a different "namespace owner". But then I found a wiki page with the description of PartitionedZookeeper and figured it might be a right approach. Please let me know if it goes long the same motivation you have in mind for PartitionedZookeeper. Do you have something in code for namespace partitioning? I'd be happy to contribute.

          People

          • Assignee:
            Unassigned
            Reporter:
            Karthik K
          • Votes:
            3 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:

              Development