Details

      Description

      When building the distributed queue for my tutorial blog post, it was pointed out to me that there's a serious inefficiency here.

      Informally, the items in the queue are created as sequential nodes. For a 'dequeue' call, all items are retrieved and sorted by name by the client in order to find the name of the next item to try and take. This costs O( n ) bandwidth and O(n.log n) sorting time - per dequeue call! Clearly this doesn't scale very well.

      If the servers were able to maintain a data structure that allowed them to efficiently retrieve the children of a node in order of the zxid that created them this would make successful dequeue operations O( 1 ) at the cost of O( n ) memory on the server (to maintain, e.g. a singly-linked list as a queue). This is a win if it is generally true that clients only want the first child in creation order, rather than the whole set.

      We could expose this to the client via this API: getFirstChild(handle, path, name_buffer, watcher) which would have much the same semantics as getChildren, but only return one znode name.

      Sequential nodes would still allow the ordering of znodes to be made explicitly available to the client in one RPC should it need it. Although: since this ordering would now be available cheaply for every set of children, it's not completely clear that there would be that many use cases left for sequential nodes if this API was augmented with a getChildrenInCreationOrder call. However, that's for a different discussion.

      A halfway-house alternative with more flexibility is to add an 'order' parameter to getFirstChild and have the server compute the first child according to the requested order (creation time, update time, lexicographical order). This saves bandwidth at the expense of increased server load, although servers can be implemented to spend memory on pre-computing commonly requested orders. I am only in favour of this approach if servers maintain a data-structure for every possible order, and then the memory implications need careful consideration.

      [edit - JIRA interprets ( n ) without the spaces as a thumbs-down. cute.]

      1. ZOOKEEPER-423.patch
        37 kB
        Henry Robinson

        Issue Links

          Activity

          Hide
          Lukas added a comment -

          Hi Henry,
          I hope i can make it more clear
          Given a scenario of a job queue with multiple workers. Each worker shall process one job at the time and the jobs should be processed in an ordered fashion (lets say LIFO in this case, could also be FIFO). Each job is represented as a child node of some zookeeper node.
          Each worker tries to process the newest job which is currently not processed by another worker. For this, he retrieves the newest not locked child and creates his lock (=ephemeral+sequential node) under this child node (grand child so to say .
          If the worker has finished the job, he deletes the lock (grand child) and the child (yes, here could be a race, but this can be circumvented).

          In this scenario it is beneficial, if the worker can get the first N children and try to lock one of them as some of them are already locked by other workers.
          Also, getAndDelete is not appropriate, as in case of a worker failure (e.g. hardware failure) the job is not in the queue anymore, but also not finished which gives a zombie job.
          By locking, I mean the zookeeper locking pattern using ephemeral+sequential nodes.

          Best,
          Lukas

          Show
          Lukas added a comment - Hi Henry, I hope i can make it more clear Given a scenario of a job queue with multiple workers. Each worker shall process one job at the time and the jobs should be processed in an ordered fashion (lets say LIFO in this case, could also be FIFO). Each job is represented as a child node of some zookeeper node. Each worker tries to process the newest job which is currently not processed by another worker. For this, he retrieves the newest not locked child and creates his lock (=ephemeral+sequential node) under this child node (grand child so to say . If the worker has finished the job, he deletes the lock (grand child) and the child (yes, here could be a race, but this can be circumvented). In this scenario it is beneficial, if the worker can get the first N children and try to lock one of them as some of them are already locked by other workers. Also, getAndDelete is not appropriate, as in case of a worker failure (e.g. hardware failure) the job is not in the queue anymore, but also not finished which gives a zombie job. By locking, I mean the zookeeper locking pattern using ephemeral+sequential nodes. Best, Lukas
          Hide
          Henry Robinson added a comment -

          Thanks - I'm afraid I need a bit more clarification (I'm slow first thing in the morning ).

          For a stack, each worker can call getLastChild which will give a FIFO ordering. Actually locking the nodes is not covered by this patch, although we could look at doing getAndDelete[First|Last]Child.

          If a worker could get the last N children, that could bring a benefit in terms of being able to batch process some nodes. Is that what you're describing?

          Henry

          Show
          Henry Robinson added a comment - Thanks - I'm afraid I need a bit more clarification (I'm slow first thing in the morning ). For a stack, each worker can call getLastChild which will give a FIFO ordering. Actually locking the nodes is not covered by this patch, although we could look at doing getAndDelete [First|Last] Child. If a worker could get the last N children, that could bring a benefit in terms of being able to batch process some nodes. Is that what you're describing? Henry
          Hide
          Lukas added a comment -

          Hi Henry,
          One case would be to have a queue with many workers. Each worker tries to lock a node and does the job/work/task associated with it. If it has finished, it removes the node from the queue.
          For a stack like / last in, first out kind of queue this ordered top N api would afaic be beneficial.

          Best,
          Lukas

          Show
          Lukas added a comment - Hi Henry, One case would be to have a queue with many workers. Each worker tries to lock a node and does the job/work/task associated with it. If it has finished, it removes the node from the queue. For a stack like / last in, first out kind of queue this ordered top N api would afaic be beneficial. Best, Lukas
          Hide
          Henry Robinson added a comment -

          Lukas -

          Good suggestion. Could you describe the use case you're thinking of, so we can properly weigh up the idea?

          Thanks -

          Henry

          Show
          Henry Robinson added a comment - Lukas - Good suggestion. Could you describe the use case you're thinking of, so we can properly weigh up the idea? Thanks - Henry
          Hide
          Lukas added a comment -

          Wouldn't it also be beneficial to generalize this method for fetching the top N nodes instead of just the first one (i.e. allow an ordered access to the child nodes)?

          Show
          Lukas added a comment - Wouldn't it also be beneficial to generalize this method for fetching the top N nodes instead of just the first one (i.e. allow an ordered access to the child nodes)?
          Hide
          Henry Robinson added a comment -

          [Non-garbage patch this time]

          This is a draft patch for this issue. I'd appreciate feedback on the approach.

          I've added two API methods - getFirstChild and getLastChild. I've changed DataNode to keep its child list in a TreeSet, not a HashSet. This may be controversial because operations are now logarithmic in the number of children, not O(1), but the datatree's hashmap is still used for most path-based lookups.

          Both API methods throw an Empty exception if there are no children.

          I've decided not to address the issue of what to sort by here, and just defaulted to sorting by node name. The API leaves the door open to specifying a sort order at the time of parent node creation (i.e. another argument to create(...)).

          This patch obviously lacks tests, comments and general cleanup. You should be able to use a shell to call lsfirst and lslast.

          Any comments on the APIs would be great - would we rather one getOrderedChild API which took a flag describing whether to get the first or last? Or a getNthChild API?

          Show
          Henry Robinson added a comment - [Non-garbage patch this time] This is a draft patch for this issue. I'd appreciate feedback on the approach. I've added two API methods - getFirstChild and getLastChild. I've changed DataNode to keep its child list in a TreeSet, not a HashSet. This may be controversial because operations are now logarithmic in the number of children, not O(1), but the datatree's hashmap is still used for most path-based lookups. Both API methods throw an Empty exception if there are no children. I've decided not to address the issue of what to sort by here, and just defaulted to sorting by node name. The API leaves the door open to specifying a sort order at the time of parent node creation (i.e. another argument to create(...)). This patch obviously lacks tests, comments and general cleanup. You should be able to use a shell to call lsfirst and lslast. Any comments on the APIs would be great - would we rather one getOrderedChild API which took a flag describing whether to get the first or last? Or a getNthChild API?
          Hide
          Henry Robinson added a comment -

          Draft patch

          Show
          Henry Robinson added a comment - Draft patch
          Hide
          Henry Robinson added a comment -

          I started taking a look at this today, just to see how hard it would be.

          Sorting by name would be extremely easy; even sorting in such a way as to make sure that sequential nodes return in order would be straightforward. This would be sufficient to support e.g. queues.

          Sorting by anything else (zxid, for example) is hard because a DataNode can't get at the data of its children, only their names. We could add references to child objects, but that would increase the size of a DataNode object by a factor of asymptotically 2. We could also add a reference to the parent DataTree, and use the map there to retrieve child objects, but that's an uglier design.

          Does anyone have any thoughts? I'm tempted just to add getOrderedChildByName(path, ordertype) and leave it there. It would be enough to help out several use cases. As a side-effect, getChildren would return a sorted list of child paths.

          Show
          Henry Robinson added a comment - I started taking a look at this today, just to see how hard it would be. Sorting by name would be extremely easy; even sorting in such a way as to make sure that sequential nodes return in order would be straightforward. This would be sufficient to support e.g. queues. Sorting by anything else (zxid, for example) is hard because a DataNode can't get at the data of its children, only their names. We could add references to child objects, but that would increase the size of a DataNode object by a factor of asymptotically 2. We could also add a reference to the parent DataTree, and use the map there to retrieve child objects, but that's an uglier design. Does anyone have any thoughts? I'm tempted just to add getOrderedChildByName(path, ordertype) and leave it there. It would be enough to help out several use cases. As a side-effect, getChildren would return a sorted list of child paths.
          Hide
          Benjamin Reed added a comment -

          we should keep in mind that someday we may have a partitioned namespace. when that happens some of these options would be hard/very expensive/blocking. NAME of course is easy. the client can always do this. when the creation happens, we can store the xid with the child's name in the parent data structure since it doesn't change, so CREATED is reasonable. MODIFIED and DATA_SIZE is more problematic/seemingly impossible in the presence of a namespace partition.

          Show
          Benjamin Reed added a comment - we should keep in mind that someday we may have a partitioned namespace. when that happens some of these options would be hard/very expensive/blocking. NAME of course is easy. the client can always do this. when the creation happens, we can store the xid with the child's name in the parent data structure since it doesn't change, so CREATED is reasonable. MODIFIED and DATA_SIZE is more problematic/seemingly impossible in the presence of a namespace partition.
          Hide
          Henry Robinson added a comment -

          (Sorry for the slow reply - didn't get a notification from JIRA for some reason...)

          That seems a good API - my only concern is that there's not an obviously neat way to get the xid of an arbitrary child without another server roundtrip. However I haven't been able to come up with any non-pathological use cases where this is an expensive problem. Since we need to return at least one xid with every call, we could consider returning a list. listChildren would of course also need a path parameter.

          If we want to add lots of different orderings, we would need to generalise xid to an abstract Key. I could imagine:

          Order.CREATED_XID_ASC/DESC
          Order.MODIFIED_XID_ASC/DESC
          Order.DATA_SIZE_ASC/DESC
          Order.NAME_ASC/DESC

          for starters which require different key types. The Key could remain opaque, but then needs to be retrieved from some node. Each order that we add either requires servers to maintain an ordered data structure or potentially spend a lot of CPU on computing the order on the fly.

          It seems like watchers can have much the same semantics as for getChildren?

          Show
          Henry Robinson added a comment - (Sorry for the slow reply - didn't get a notification from JIRA for some reason...) That seems a good API - my only concern is that there's not an obviously neat way to get the xid of an arbitrary child without another server roundtrip. However I haven't been able to come up with any non-pathological use cases where this is an expensive problem. Since we need to return at least one xid with every call, we could consider returning a list. listChildren would of course also need a path parameter. If we want to add lots of different orderings, we would need to generalise xid to an abstract Key. I could imagine: Order.CREATED_XID_ASC/DESC Order.MODIFIED_XID_ASC/DESC Order.DATA_SIZE_ASC/DESC Order.NAME_ASC/DESC for starters which require different key types. The Key could remain opaque, but then needs to be retrieved from some node. Each order that we add either requires servers to maintain an ordered data structure or potentially spend a lot of CPU on computing the order on the fly. It seems like watchers can have much the same semantics as for getChildren?
          Hide
          Patrick Hunt added a comment -

          Take a look at ZOOKEEPER-282 - perhaps should wrap this all up together with

          List<String> listChildren(start, count, order, watch/watcher)

          type of api (async too)

          tricky bit is probably the semantics of watch? also how to keep track of the znodes
          btw calls? Perhaps start is the opaque xid which we somehow provide to the client for
          use in starting/continuing calls:

          xid = Xid.first()
          List<String> listChildren(xid, count, Order.XID_DESC, watch/watcher)
          ...
          List<String> listChildren(xid, count, Order.XID_DESC, watch/watcher)
          ... // loop until null returned

          Show
          Patrick Hunt added a comment - Take a look at ZOOKEEPER-282 - perhaps should wrap this all up together with List<String> listChildren(start, count, order, watch/watcher) type of api (async too) tricky bit is probably the semantics of watch? also how to keep track of the znodes btw calls? Perhaps start is the opaque xid which we somehow provide to the client for use in starting/continuing calls: xid = Xid.first() List<String> listChildren(xid, count, Order.XID_DESC, watch/watcher) ... List<String> listChildren(xid, count, Order.XID_DESC, watch/watcher) ... // loop until null returned

            People

            • Assignee:
              Unassigned
              Reporter:
              Henry Robinson
            • Votes:
              3 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:

                Development