ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-866

Adding no disk persistence option in zookeeper.

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 3.5.0
    • Component/s: None
    • Labels:
      None

      Description

      Its been seen that some folks would like to use zookeeper for very fine grained locking. Also, in there use case they are fine with loosing all old zookeeper state if they reboot zookeeper or zookeeper goes down. The use case is more of a runtime locking wherein forgetting the state of locks is acceptable in case of a zookeeper reboot. Not logging to disk allows high throughput on and low latency on the writes to zookeeper. This would be a configuration option to set (ofcourse the default would be logging to disk).

        Issue Links

          Activity

          Hide
          Germán Blanco added a comment -

          I wouldn't recommend to run zookeeper using ramdisk without having an additional method that ensures consistency. Ramdisk means data is not persistent after a reboot, and this JIRA shows that losing data in one node when the quorum was in minority may lead to permanent inconsistencies in the quorum.

          Show
          Germán Blanco added a comment - I wouldn't recommend to run zookeeper using ramdisk without having an additional method that ensures consistency. Ramdisk means data is not persistent after a reboot, and this JIRA shows that losing data in one node when the quorum was in minority may lead to permanent inconsistencies in the quorum.
          Hide
          Graham added a comment -

          We ran some benchmarks using zk-latencies.

          We tried to ways to boost performance: Using a RAM disk (tmpfs) and using libeatmydata (Makes all file system sync operations no-ops).

          libeatmydata benchmarks: http://pastebin.com/cNLjfhPG

          Ramdisk using tmpfs benchmarks: http://pastebin.com/mfe92nXn (Note: Different box to the last one)

          In summary: Synchronous calls are boosted by two orders of magnitude with either libeatmydata or ramdisk (In Standalone mode and also clustered mode). Asynchronous calls are boosted by a factor of 2 or 3.

          For tests, simulations etc. a Zookeeper without snapshots or logs makes a lot of sense, but for production use, the ramdisk or eatmydata options both looks pretty good.

          Another thing we found works well is to have a battery backed raid array; writes to go to raid cache and will sync to disk eventually.

          Show
          Graham added a comment - We ran some benchmarks using zk-latencies. We tried to ways to boost performance: Using a RAM disk (tmpfs) and using libeatmydata (Makes all file system sync operations no-ops). libeatmydata benchmarks: http://pastebin.com/cNLjfhPG Ramdisk using tmpfs benchmarks: http://pastebin.com/mfe92nXn (Note: Different box to the last one) In summary: Synchronous calls are boosted by two orders of magnitude with either libeatmydata or ramdisk (In Standalone mode and also clustered mode). Asynchronous calls are boosted by a factor of 2 or 3. For tests, simulations etc. a Zookeeper without snapshots or logs makes a lot of sense, but for production use, the ramdisk or eatmydata options both looks pretty good. Another thing we found works well is to have a battery backed raid array; writes to go to raid cache and will sync to disk eventually.
          Hide
          Matthew Krenzer added a comment -

          I'd like to second the request for this feature for another reason. We have a large set of simulations and tests where zookeeper is a small but important part. We use a server instance embedded in the tests where necessary and we want to be able to run many of these tests on the same box at the same time. Currently we have to muck about creating unique directories for each instance of each test to ensure there are no conflicts and do a bunch of work to clean up all the droppings regardless of how the tests exit. It's a minor hassle, but it would be really nice if we could configure zookeeper to not need any of those files.

          Show
          Matthew Krenzer added a comment - I'd like to second the request for this feature for another reason. We have a large set of simulations and tests where zookeeper is a small but important part. We use a server instance embedded in the tests where necessary and we want to be able to run many of these tests on the same box at the same time. Currently we have to muck about creating unique directories for each instance of each test to ensure there are no conflicts and do a bunch of work to clean up all the droppings regardless of how the tests exit. It's a minor hassle, but it would be really nice if we could configure zookeeper to not need any of those files.
          Hide
          Per Steffensen added a comment -

          If it really true that writing to disk is as fast as not writing to disk, it indicates to me that something could be optimized in the case where you do not have to write to disk. I would still vote for the feature, because I simply do not believe that it cannot be made faster when you do not have to write to disk than when you have to write to disk. But maybe, if I need it badly enough, I will have to make it myself. Maybe including some analysis on why it is not faster not writing to disk than writing to disk.
          But even though it cannot be made faster it would still be a nice feature, because I believe it is important that an API allows users to indicate their true intention - and using EPHEMERAL really indicates something else that what you would indicate with an NON_PERSISTENT option.
          But thanks for your reply.
          Regards

          Show
          Per Steffensen added a comment - If it really true that writing to disk is as fast as not writing to disk, it indicates to me that something could be optimized in the case where you do not have to write to disk. I would still vote for the feature, because I simply do not believe that it cannot be made faster when you do not have to write to disk than when you have to write to disk. But maybe, if I need it badly enough, I will have to make it myself. Maybe including some analysis on why it is not faster not writing to disk than writing to disk. But even though it cannot be made faster it would still be a nice feature, because I believe it is important that an API allows users to indicate their true intention - and using EPHEMERAL really indicates something else that what you would indicate with an NON_PERSISTENT option. But thanks for your reply. Regards
          Hide
          Mahadev konar added a comment -

          @peter,
          I didnt. From what I found was that the throughput when writing to disk was as good as the throughput with no persistence, so I didnt bother getting this in.

          Show
          Mahadev konar added a comment - @peter, I didnt. From what I found was that the throughput when writing to disk was as good as the throughput with no persistence, so I didnt bother getting this in.
          Hide
          Per Steffensen added a comment -

          Did you ever add suppoert for non-persistent "entries" in ZK? I believe it will be usefull in a lot of common scenarios. I could certainly use it.

          Show
          Per Steffensen added a comment - Did you ever add suppoert for non-persistent "entries" in ZK? I believe it will be usefull in a lot of common scenarios. I could certainly use it.
          Hide
          Greg Moulliet added a comment -

          I'm also interested in a feature to decrease latency, which seems like what this patch might do.
          We'd like to use ZK for high-throughput, low latency, temporary storage.

          I've been running some tests with an ensemble in EC2, and noticed a significant decrease in latency (50%) separating dataDir and dataLogDir. However, when I switched from physical disk to ramdisk, the latency didn't change, agreeing with Mahadev that logging isn't the bottleneck.

          Show
          Greg Moulliet added a comment - I'm also interested in a feature to decrease latency, which seems like what this patch might do. We'd like to use ZK for high-throughput, low latency, temporary storage. I've been running some tests with an ensemble in EC2, and noticed a significant decrease in latency (50%) separating dataDir and dataLogDir. However, when I switched from physical disk to ramdisk, the latency didn't change, agreeing with Mahadev that logging isn't the bottleneck.
          Hide
          Mahadev konar added a comment -

          maya/jurgen,
          I am happy to see good results. Some folks wanted to use ZK for fine grained locking and I had done this as a hack for them but they had done some extensive experiments and had suggested to me that it did not help them. Please do run some extensive tests to make sure it actually helps. I'd be happy to get this into 3.4 release (though it needs a little more work).

          Show
          Mahadev konar added a comment - maya/jurgen, I am happy to see good results. Some folks wanted to use ZK for fine grained locking and I had done this as a hack for them but they had done some extensive experiments and had suggested to me that it did not help them. Please do run some extensive tests to make sure it actually helps. I'd be happy to get this into 3.4 release (though it needs a little more work).
          Hide
          Maya D added a comment -

          From some preliminary experiments we found the transaction log I/O to be a bottleneck. I'm now experimenting with putting the transaction log on a 256M RAM disk, setting forceSync=no, and bumping up snapCount to 500K or 1M. In our case we don't care about persistence, and we are fine with loosing state on a server crash.

          Show
          Maya D added a comment - From some preliminary experiments we found the transaction log I/O to be a bottleneck. I'm now experimenting with putting the transaction log on a 256M RAM disk, setting forceSync=no, and bumping up snapCount to 500K or 1M. In our case we don't care about persistence, and we are fine with loosing state on a server crash.
          Hide
          Jürgen Schumacher added a comment -

          Ok, thanks for the answer. I have to admit that we are abusing Zookeeper and don't have a separate cluster or separate disks for it and in such use cases the patch does increase the throughput greatly (sorry for this . So I'd like to ask if the data loss is just a problem of the patch and can probably be fixed (relatively) easily or if Zookeeper does really need the persistence for providing the reliability. My (and my colleagues') impression from reading the docs and some mails in the user lists was that the persistence (apart from keeping the state after complete system restarts, of course) is good for faster recovery if a server crashes and reintegrates with the ensemble, but that in principal it should work without having any persistence at all by reading the state from the other servers, because all the necessary data is in memory anyway. Was this a misunderstanding?

          Show
          Jürgen Schumacher added a comment - Ok, thanks for the answer. I have to admit that we are abusing Zookeeper and don't have a separate cluster or separate disks for it and in such use cases the patch does increase the throughput greatly (sorry for this . So I'd like to ask if the data loss is just a problem of the patch and can probably be fixed (relatively) easily or if Zookeeper does really need the persistence for providing the reliability. My (and my colleagues') impression from reading the docs and some mails in the user lists was that the persistence (apart from keeping the state after complete system restarts, of course) is good for faster recovery if a server crashes and reintegrates with the ensemble, but that in principal it should work without having any persistence at all by reading the state from the other servers, because all the necessary data is in memory anyway. Was this a misunderstanding?
          Hide
          Mahadev konar added a comment -

          jurgen/Maya,
          We had exprimented with this patch a lot and realized that the throughput does not change a lot without logging to disk. The numbers were almost as close to logging to disk. Logging to disk wasnt a bottleneck. We had been trying to find out what might have increased the thoughput but had didnt get a chance to work through it.

          Also, the patch as Jurgen said if the Leader server crashes will bring down the whole cluster.

          Show
          Mahadev konar added a comment - jurgen/Maya, We had exprimented with this patch a lot and realized that the throughput does not change a lot without logging to disk. The numbers were almost as close to logging to disk. Logging to disk wasnt a bottleneck. We had been trying to find out what might have increased the thoughput but had didnt get a chance to work through it. Also, the patch as Jurgen said if the Leader server crashes will bring down the whole cluster.
          Hide
          Jürgen Schumacher added a comment -

          Hi, I tried this patch with our application with Zokkeeper 3.3.3, because we do not care for persistence of data after complete system restarts but we need reliablity if only single Zookeeper servers crash and restart later. Is it correct that with this path the Zookeeper ensemble loses all currently stored data when just the leader server crashes or is killed (our test ensemble consists of 5 nodes)? I would have expected that each follower has the complete current data in memory and can continue to work on it, when it becomes the new leader. Or is this assumption wrong? Thanks.

          Show
          Jürgen Schumacher added a comment - Hi, I tried this patch with our application with Zokkeeper 3.3.3, because we do not care for persistence of data after complete system restarts but we need reliablity if only single Zookeeper servers crash and restart later. Is it correct that with this path the Zookeeper ensemble loses all currently stored data when just the leader server crashes or is killed (our test ensemble consists of 5 nodes)? I would have expected that each follower has the complete current data in memory and can continue to work on it, when it becomes the new leader. Or is this assumption wrong? Thanks.
          Hide
          Maya D added a comment -

          Mahadev, any plans to add that configuration option and incorporate the patch into the next zookeeper release? I would like to use zookeeper with fine-grained locking to prevent cache stampedes in a high concurrency application.

          Show
          Maya D added a comment - Mahadev, any plans to add that configuration option and incorporate the patch into the next zookeeper release? I would like to use zookeeper with fine-grained locking to prevent cache stampedes in a high concurrency application.
          Hide
          Mahadev konar added a comment -

          Thomas, we have spent the last 3 years optimizing the throughput and latency of zookeeper. I think we have reach the point of minimal returns with this. I agree on the usability front you do have a point. But making it usable is orthogonal to what I propose over here. Both can take different directions. I am just trying to open a new area of usage for zookeeper. Does that make sense ?

          Show
          Mahadev konar added a comment - Thomas, we have spent the last 3 years optimizing the throughput and latency of zookeeper. I think we have reach the point of minimal returns with this. I agree on the usability front you do have a point. But making it usable is orthogonal to what I propose over here. Both can take different directions. I am just trying to open a new area of usage for zookeeper. Does that make sense ?
          Hide
          Thomas Koch added a comment -

          Thinking more about this, I'd consider this requirement a premature optimization. Wouldn't it be much more important to make ZK rock solid stable, well documented (also the code) and polish the usability of ZK before worrying about speed?
          Are you sure, that there's no other way to solve the issue at hand? Instead of fine grained locking, maybe it'd be possible to lock packages at once. Or some kind of constant hashing: By placing a node on a constant hashing ring, I lock everything until the next node.
          It'd be interesting to see the use case that require such fine grained locking, if that'd be possible.

          Show
          Thomas Koch added a comment - Thinking more about this, I'd consider this requirement a premature optimization. Wouldn't it be much more important to make ZK rock solid stable, well documented (also the code) and polish the usability of ZK before worrying about speed? Are you sure, that there's no other way to solve the issue at hand? Instead of fine grained locking, maybe it'd be possible to lock packages at once. Or some kind of constant hashing: By placing a node on a constant hashing ring, I lock everything until the next node. It'd be interesting to see the use case that require such fine grained locking, if that'd be possible.
          Hide
          Flavio Junqueira added a comment -

          I have a few concerns about your proposals, Thomas. Making just part of the state persistent is prone to error, since failure and recovery of enough replicas may lead to different and inconsistent views of the zookeeper state over time. Also, would we enforce that a child is persistent only if the parent is persistent? Delaying persistence of data could also cause inconsistent views of the zookeeper state over time (some data disappearing).

          Alternatively, we could simply assume that at no time there is more than a quorum of crashed, and if you're happy with this proposal, then you modifications wouldn't be necessary. In fact, I believe this is an assumption that Mahadev's proposal makes to guarantee a correct behavior.

          One feature that might be useful is a direct command to a zookeeper server to have it dump its state to disk. Suppose that I'm operating a purely in-memory ensemble as proposed in this jira, and I want to shut down all servers for maintenance, but don't want to lose my state. I could stop all clients, command the servers to dump their state, and shut them down gracefully.

          Show
          Flavio Junqueira added a comment - I have a few concerns about your proposals, Thomas. Making just part of the state persistent is prone to error, since failure and recovery of enough replicas may lead to different and inconsistent views of the zookeeper state over time. Also, would we enforce that a child is persistent only if the parent is persistent? Delaying persistence of data could also cause inconsistent views of the zookeeper state over time (some data disappearing). Alternatively, we could simply assume that at no time there is more than a quorum of crashed, and if you're happy with this proposal, then you modifications wouldn't be necessary. In fact, I believe this is an assumption that Mahadev's proposal makes to guarantee a correct behavior. One feature that might be useful is a direct command to a zookeeper server to have it dump its state to disk. Suppose that I'm operating a purely in-memory ensemble as proposed in this jira, and I want to shut down all servers for maintenance, but don't want to lose my state. I could stop all clients, command the servers to dump their state, and shut them down gracefully.
          Hide
          Thomas Koch added a comment -

          Just 2c: I'm going to use ZK too for some middle grained locking, locking domains a crawler is working on.

          • I'd be very fine, if I could specify durability for every single create. This way I can use the same ZK instance for fine grained locking as well as for system configuration. Maybe another flag besides PERSISTENT | EPHEMERAL: DURABLE?
          • Actually there are more shades of grey then just durable and non durable: It'd be possible, that not so important locks will only be flushed every few minutes. But maybe this would make things too complicate?
          Show
          Thomas Koch added a comment - Just 2c: I'm going to use ZK too for some middle grained locking, locking domains a crawler is working on. I'd be very fine, if I could specify durability for every single create. This way I can use the same ZK instance for fine grained locking as well as for system configuration. Maybe another flag besides PERSISTENT | EPHEMERAL: DURABLE? Actually there are more shades of grey then just durable and non durable: It'd be possible, that not so important locks will only be flushed every few minutes. But maybe this would make things too complicate?
          Hide
          Mahadev konar added a comment -

          Here is a patch that I had worked on. This is not a complete patch since this actually changes the code and doesnt make it configurable option. I will be creating another patch wherein there is a configuration option for this.

          Feedback is welcome.

          Show
          Mahadev konar added a comment - Here is a patch that I had worked on. This is not a complete patch since this actually changes the code and doesnt make it configurable option. I will be creating another patch wherein there is a configuration option for this. Feedback is welcome.

            People

            • Assignee:
              Mahadev konar
              Reporter:
              Mahadev konar
            • Votes:
              6 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:

                Development