Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: c client
    • Labels:
      None

      Description

      Motivation:
      1. Reduce the impact of client restarts on zookeeper by implementing a persisted cache, and only fetching deltas on restart
      2. Reduce unnecessary calls to zookeeper.
      3. Improve performance of gets by caching on the client
      4. Allow for larger caches than in memory caches.

      Behavior Change:
      Zookeeper clients will not have the option to specify a folder path where it can cache zookeeper gets. If they do choose to cache results, the zookeeper library will check the persisted cache before actually sending a request to zookeeper. Watches will automatically be placed on all gets in order to invalidate the cache. Alternatively, we can add a cache flag to the get API - thoughts? On reconnect or restart, zookeeper clients will check the version number of each entries into its persisted cache, and will invalidate any old entries. In checking version number, zookeeper clients will also place a watch on those files. In regards to watches, client watch handlers will not fire until the invalidation step is completed, which may slow down client watch handling. Since setting up watches on all files is necessary on initialization, initialization will likely slow down as well.

      API Change:
      The zookeeper library will expose a new init interface that specifies a folder path to the cache. A new get API will specify whether or not to use cache, and whether or not stale data is safe to return if the connection is down.

      Design:
      The zookeeper handler structure will now include a cache_root_path (possibly null) string to cache all gets, as well as a bool for whether or not it is okay to serve stale data. Old API calls will default to a null path (which signifies no cache), and signify that it is not okay to serve stale data.

      The cache will be located at a cache_root_path. All files will be placed at cache_root_path/file_path. The cache will be an incomplete copy of everything that is in zookeeper, but everything in the cache will have the same relative path from the cache_root_path that it has as a path in zookeeper. Each file in the cache will include the Statstructure and the file contents.

      zoo_get will check the zookeeper handler to determine whether or not it has a cache. If it does, it will first go to the path to the persisted cache and append the get path. If the file exists and it is not invalidated, the zookeeper client will read it and return its value. If the file does not exist or is invalidated, the zookeeper library will perform the same get as is currently designed. After getting the results, the library will place the value in the persisted cache for subsequent reads. zoo_set will automatically invalidate the path in the cache.

      If caching is requested, then on each zoo_get that goes through to zookeeper, a watch will be placed on the path. A cache watch handler will handle all watch events by invalidating the cache, and placing another watch on it. Client watch handlers will handle the watch event after the cache watch handler. The cache watch handler will not call zoo_get, because it is assumed that the client watch handlers will call zoo_get if they need the fresh data as soon as it is invalidated (which is why the cache watch handler must be executed first).

      All updates to the cache will be done on a separate thread, but will be queued in order to maintain consistency in the cache. In addition, all client watch handlers will not be fired until the cache watch handler completes its invalidation write in order to ensure that client calls to zoo_get in the watch event handler are done after the invalidation step. This means that a client watch handler could be waiting on SEVERAL writes before it can be fired off, since all writes are queued.

      When a new connection is made, if a zookeeper handler has a cache, then that cache will be scanned in order to find all leaf nodes. Calls will be made to zookeeper to check if all of these nodes still exist, and if they do, what their version number is. Any inconsistencies in version will result in the cache invalidating the out of date files. Any files that no longer exist will be deleted from the cache.

      If a connection fails, and a zoo_get call is made on a zookeeper handler that has a cache associated with it, and that cache tolerates stale data, then the stale data will be returned from cache - otherwise, all zoo_gets will error out as they do today.

        Activity

        Hide
        Mahadev konar added a comment -

        @Marc,

        Can you elaborate on the use case for this? What are the issues that you are facing which is creating a need for client side caching? Also, on a restart wont the client cache be invalid? Do you plan to persist the session and make sure you restart within the session expiry?

        Show
        Mahadev konar added a comment - @Marc, Can you elaborate on the use case for this? What are the issues that you are facing which is creating a need for client side caching? Also, on a restart wont the client cache be invalid? Do you plan to persist the session and make sure you restart within the session expiry?
        Hide
        Marc Celani added a comment -

        The main goal of this feature is to reduce the load to zookeeper, both on restart and during regular use. The persisted cache will include the version of each node that it stores. We can check the version of all the nodes in our persisted cache on restart, and invalidate only the nodes that are out of date (so no, the entire cache is not invalidated - only the nodes that have changed). This change requires that there be a methodology for setting watches without calling zoo_get, but this is reasonable. The session does not need to be persisted.

        Show
        Marc Celani added a comment - The main goal of this feature is to reduce the load to zookeeper, both on restart and during regular use. The persisted cache will include the version of each node that it stores. We can check the version of all the nodes in our persisted cache on restart, and invalidate only the nodes that are out of date (so no, the entire cache is not invalidated - only the nodes that have changed). This change requires that there be a methodology for setting watches without calling zoo_get, but this is reasonable. The session does not need to be persisted.
        Hide
        Flavio Junqueira added a comment -

        Thanks for your proposal, Marc. I have a few comments about it.

        This proposal states that the goal is to reduce the traffic to zookeeper, but I'm not sure how you achieve it. You seem to assume that applications execute gets without setting watches for znodes they access frequently. I would think that any application carefully designed will set watches on znodes for which they need to know of changes. In that case, your proposal becomes a facility to help the application to manage. Is it the case? Does this comment make sense to you? Do you see it differently or have a use case?

        Assuming that it still makes sense, can't you implement it as a layer on top without modifying the API?

        Show
        Flavio Junqueira added a comment - Thanks for your proposal, Marc. I have a few comments about it. This proposal states that the goal is to reduce the traffic to zookeeper, but I'm not sure how you achieve it. You seem to assume that applications execute gets without setting watches for znodes they access frequently. I would think that any application carefully designed will set watches on znodes for which they need to know of changes. In that case, your proposal becomes a facility to help the application to manage. Is it the case? Does this comment make sense to you? Do you see it differently or have a use case? Assuming that it still makes sense, can't you implement it as a layer on top without modifying the API?
        Hide
        Marc Celani added a comment -

        Flavio,
        Absolutely, a cache could easily be built on top of the API with watches. But with the current API, restarts invalidate the cache (a concern Mahadev brought up - but that's exactly what this API change would solve).

        The API that needs to be exposed in order to build a persisted cache on top of the c client is a call that, in one transaction, returns the version of the node and sets a get watch. That way, the client can use the returned version to invalidate its cache if its out of date, and is now listening for changes.

        The reason I think implementing the persisted cache at the zookeeper level rather than asking applications to build it on top of zookeeper is that its a common feature of many zookeeper clients.

        Show
        Marc Celani added a comment - Flavio, Absolutely, a cache could easily be built on top of the API with watches. But with the current API, restarts invalidate the cache (a concern Mahadev brought up - but that's exactly what this API change would solve). The API that needs to be exposed in order to build a persisted cache on top of the c client is a call that, in one transaction, returns the version of the node and sets a get watch. That way, the client can use the returned version to invalidate its cache if its out of date, and is now listening for changes. The reason I think implementing the persisted cache at the zookeeper level rather than asking applications to build it on top of zookeeper is that its a common feature of many zookeeper clients.
        Hide
        Vishal Kathuria added a comment -

        Mahadev,Flavio

        Thanks for reviewing Marc's proposal.

        Here is some more info on the use case.
        We are planning on storing the configuration about which servers a client should talk to in ZooKeeper.
        1. As Marc already mentioned, if a large number of clients start, the read traffic would max out the ZK NICs - so we cannot afford that. The time it takes to get all the config (could be up to 50M) is large too. And in most cases, most of the data hasn't changed.
        2. The clients of the service want the config to be available even if ZK is down (along with some staleness indicator).
        3. We debated whether to build this caching in ZK client or build it in the app on top of it. The same requirement surfaced in multiple applications and hence we decided to go for building it in the ZK client. Another way to think of it is that you have a locally running observer, the difference being that it is only observing the znodes you have read so far and it also supports a disconnected mode.

        Thanks!

        Show
        Vishal Kathuria added a comment - Mahadev,Flavio Thanks for reviewing Marc's proposal. Here is some more info on the use case. We are planning on storing the configuration about which servers a client should talk to in ZooKeeper. 1. As Marc already mentioned, if a large number of clients start, the read traffic would max out the ZK NICs - so we cannot afford that. The time it takes to get all the config (could be up to 50M) is large too. And in most cases, most of the data hasn't changed. 2. The clients of the service want the config to be available even if ZK is down (along with some staleness indicator). 3. We debated whether to build this caching in ZK client or build it in the app on top of it. The same requirement surfaced in multiple applications and hence we decided to go for building it in the ZK client. Another way to think of it is that you have a locally running observer, the difference being that it is only observing the znodes you have read so far and it also supports a disconnected mode. Thanks!
        Hide
        Flavio Junqueira added a comment -

        Hi Vishal, Marc, Thanks for giving more detail on your use case. It is an interesting use case, and it sounds fine to have a cache layer as you suggest, specially if data stored in zookeeper changes infrequently, there is a large number of clients, and data fetched at startup can be of the order of megabytes.

        My main concern is the change to the API. I can't find a strong reason in your proposal to change the API, and it seems to be simply a convenient design choice. If there is no strong reason for changing the API, I would discourage such a change.

        Show
        Flavio Junqueira added a comment - Hi Vishal, Marc, Thanks for giving more detail on your use case. It is an interesting use case, and it sounds fine to have a cache layer as you suggest, specially if data stored in zookeeper changes infrequently, there is a large number of clients, and data fetched at startup can be of the order of megabytes. My main concern is the change to the API. I can't find a strong reason in your proposal to change the API, and it seems to be simply a convenient design choice. If there is no strong reason for changing the API, I would discourage such a change.
        Hide
        Marc Celani added a comment -

        Thanks for your comments everyone. It sounds like the feedback is that the cache API is too invasive, and the use case too narrow to warrant the big changes. Below, I've included two alternative changes that we could make that are less invasive, and allow for a broader range of use cases. If you think these are acceptable, I can open a new JIRA and abandon this one.

        // returns last_zxid
        long long zoo_get_last_zxid(zhandle_t *zh);

        // Adds the watches to the internal hashtables
        // When connected, internal logic will send the watches, as if we are handling reconnect.
        // paths_to_watch: list of paths we want to watch
        // watch_type: list of watch types
        // num_of_paths: lengths of last two arrays.
        // last_zxid: The last know zxid, which will be used to fire watches that would have fired between the last_zxid and
        // what the true zxid is.
        zookeeper_init2(const char *host, watcher_fn watcher, int recv_timeout, const clientid_t *clientid, void *context, int flags, char **paths_to_watch, int *watch_type, int num_of_paths, long long last_zxid);

        If we persist our last seen zxid and watch list, we can treat restart as if it were prolonged disconnected state. Assuming that the client has a large set of data that does not change often, the client can persist locally and reduce traffic. The client can build their cache on top of this API, and the changes are less invasive.

        Show
        Marc Celani added a comment - Thanks for your comments everyone. It sounds like the feedback is that the cache API is too invasive, and the use case too narrow to warrant the big changes. Below, I've included two alternative changes that we could make that are less invasive, and allow for a broader range of use cases. If you think these are acceptable, I can open a new JIRA and abandon this one. // returns last_zxid long long zoo_get_last_zxid(zhandle_t *zh); // Adds the watches to the internal hashtables // When connected, internal logic will send the watches, as if we are handling reconnect. // paths_to_watch: list of paths we want to watch // watch_type: list of watch types // num_of_paths: lengths of last two arrays. // last_zxid: The last know zxid, which will be used to fire watches that would have fired between the last_zxid and // what the true zxid is. zookeeper_init2(const char *host, watcher_fn watcher, int recv_timeout, const clientid_t *clientid, void *context, int flags, char **paths_to_watch, int *watch_type, int num_of_paths, long long last_zxid); If we persist our last seen zxid and watch list, we can treat restart as if it were prolonged disconnected state. Assuming that the client has a large set of data that does not change often, the client can persist locally and reduce traffic. The client can build their cache on top of this API, and the changes are less invasive.
        Hide
        Flavio Junqueira added a comment -

        To confirm, your proposed API changes do not break compatibility, since it adds calls to the API without changing existing calls. Is this right?

        Since this feature doesn't have to be implemented inside the client library, I would still prefer to have it implemented on top. This way it does not introduce more complexity to the library.

        Show
        Flavio Junqueira added a comment - To confirm, your proposed API changes do not break compatibility, since it adds calls to the API without changing existing calls. Is this right? Since this feature doesn't have to be implemented inside the client library, I would still prefer to have it implemented on top. This way it does not introduce more complexity to the library.
        Hide
        Marc Celani added a comment -

        Flavio and Mahadev, We have given this a lot of thought, and have come to the conclusion that we think that this feature is very useful, and belongs in the client, not on top of it.

        We chose to use Zookeeper because of its strict ordering requirements and versioning. We want to cache zookeeper responses to help us scale without losing these properties. Caching is a cheap and effective way of scaling zookeeper. Watches in zookeeper allow for a push model that suits a cache well. Zookeeper is almost screaming for people to cache on top of it. However, the semantics of the Zookeeper API actually make implementing a consistent cache difficult.

        After much thought, the only way in which we could arrive at a consistent cache (persisted or not) was to wrap the entire Zookeeper API, and to implement a request queue with a thread that checks cache in order and returns responses in order (misses go through to zookeeper, but still need to be returned in order). It also requires a separate watch manager from the one in the c client, since we need to set a cache watch on each and every operation (to maintain consistency), but still want to delegate the watch to a user who is interested in the watch as well. Considering that our wrapper is now implementing an input request queue, an output completion queue, a watch manager, and must check cache for each operation, we are essentially rewriting the entire zookeeper c client implementation.

        Given the amount of thought that we had to put into this to arrive at this conclusion, and the amount of work needed to actually implement it, it's very easy to produce bugs. We'd also have to maintain the code every time that zookeeper makes an API change.

        If it lived inside of the zookeeper client, we could guarantee ordering by doing all of our cache queueing within the I/O thread, and keeping in memory data structures to know if we can serve a response from cache.

        We still believe that this is a feature that will greatly benefit the community. I don't see how zookeeper can scale without at some level implementing a persisted, consistent cache. In the case that it's not useful for whatever reason, it doesn't break backwards compatibility (the cache by default would be turned off).

        Let me know if there's anything else I can clarify to help make my case!

        Show
        Marc Celani added a comment - Flavio and Mahadev, We have given this a lot of thought, and have come to the conclusion that we think that this feature is very useful, and belongs in the client, not on top of it. We chose to use Zookeeper because of its strict ordering requirements and versioning. We want to cache zookeeper responses to help us scale without losing these properties. Caching is a cheap and effective way of scaling zookeeper. Watches in zookeeper allow for a push model that suits a cache well. Zookeeper is almost screaming for people to cache on top of it. However, the semantics of the Zookeeper API actually make implementing a consistent cache difficult. After much thought, the only way in which we could arrive at a consistent cache (persisted or not) was to wrap the entire Zookeeper API, and to implement a request queue with a thread that checks cache in order and returns responses in order (misses go through to zookeeper, but still need to be returned in order). It also requires a separate watch manager from the one in the c client, since we need to set a cache watch on each and every operation (to maintain consistency), but still want to delegate the watch to a user who is interested in the watch as well. Considering that our wrapper is now implementing an input request queue, an output completion queue, a watch manager, and must check cache for each operation, we are essentially rewriting the entire zookeeper c client implementation. Given the amount of thought that we had to put into this to arrive at this conclusion, and the amount of work needed to actually implement it, it's very easy to produce bugs. We'd also have to maintain the code every time that zookeeper makes an API change. If it lived inside of the zookeeper client, we could guarantee ordering by doing all of our cache queueing within the I/O thread, and keeping in memory data structures to know if we can serve a response from cache. We still believe that this is a feature that will greatly benefit the community. I don't see how zookeeper can scale without at some level implementing a persisted, consistent cache. In the case that it's not useful for whatever reason, it doesn't break backwards compatibility (the cache by default would be turned off). Let me know if there's anything else I can clarify to help make my case!
        Hide
        Mahadev konar added a comment -

        Marc,
        Sorry I've been a little busy with 3.4. Would definitely comment on the jira after reading/thinking through this.

        thanks

        Show
        Mahadev konar added a comment - Marc, Sorry I've been a little busy with 3.4. Would definitely comment on the jira after reading/thinking through this. thanks

          People

          • Assignee:
            Marc Celani
            Reporter:
            Marc Celani
          • Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Development