Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:

      Description

      It would be extremely useful to be able to have a "getChildrenWithStat" method. This method would behave exactly the same as getChildren but in addition to returning the list of all child znode names it would also return a Stat for each child. I'm sure there are quite a few use cases for this but it could save a lot of extra reads for my application.

        Activity

        Daniel Lord created issue -
        Hide
        Mahadev konar added a comment -

        Agree. Would be very useful!

        Show
        Mahadev konar added a comment - Agree. Would be very useful!
        Patrick Hunt made changes -
        Field Original Value New Value
        Labels newbie
        Hide
        Daniel Lord added a comment -

        I'm working on getting corporate approval to submit patches but I'm interested in taking a stab at this. I spoke with Pat Hunt today about this issue and it sounds like a good starting point to contribute.

        Show
        Daniel Lord added a comment - I'm working on getting corporate approval to submit patches but I'm interested in taking a stab at this. I spoke with Pat Hunt today about this issue and it sounds like a good starting point to contribute.
        Hide
        Patrick Hunt added a comment -

        Can you do this with multiupdate today?

        Unfortunately datanode only keeps track of the child names, not the datanodes of children. Regardless the locking would also be an issue. (doable but not as clean/inexpensive as one would like).

        Show
        Patrick Hunt added a comment - Can you do this with multiupdate today? Unfortunately datanode only keeps track of the child names, not the datanodes of children. Regardless the locking would also be an issue. (doable but not as clean/inexpensive as one would like).
        Hide
        Daniel Lord added a comment -

        It seems like maybe. It would require one extra round trip (getChildren, foreach child -> multiupdate) but that should be fine. I haven't had a chance to play around with 3.4 at all yet but I'll take a look at it.

        Show
        Daniel Lord added a comment - It seems like maybe. It would require one extra round trip (getChildren, foreach child -> multiupdate) but that should be fine. I haven't had a chance to play around with 3.4 at all yet but I'll take a look at it.
        Hide
        Patrick Hunt added a comment -

        Answering my own question, re multiupdate: looks like not:

                            switch (subTxnResult.type) {
                                case OpCode.check:
                                    subResult = new CheckResult();
                                    break;
                                case OpCode.create:
                                    subResult = new CreateResult(subTxnResult.path);
                                    break;
                                case OpCode.delete:
                                    subResult = new DeleteResult();
                                    break;
                                case OpCode.setData:
                                    subResult = new SetDataResult(subTxnResult.stat);
                                    break;
                                case OpCode.error:
                                    subResult = new ErrorResult(subTxnResult.err) ;
                                    break;
                                default:
                                    throw new IOException("Invalid type of op");
                            }
        
        Show
        Patrick Hunt added a comment - Answering my own question, re multiupdate: looks like not: switch (subTxnResult.type) { case OpCode.check: subResult = new CheckResult(); break; case OpCode.create: subResult = new CreateResult(subTxnResult.path); break; case OpCode.delete: subResult = new DeleteResult(); break; case OpCode.setData: subResult = new SetDataResult(subTxnResult.stat); break; case OpCode.error: subResult = new ErrorResult(subTxnResult.err) ; break; default: throw new IOException("Invalid type of op"); }
        Hide
        Ted Dunning added a comment -

        The thought that there should be a few more multi ops has come up before. Exists is a strong candidate. GetData is another.

        Show
        Ted Dunning added a comment - The thought that there should be a few more multi ops has come up before. Exists is a strong candidate. GetData is another.
        Hide
        Benjamin Reed added a comment -

        why do you need multi op? you could do a getChildren() and then shoot off a bunch of async exists. the problem with doing stat with getChildren() is that you make the large directory problem (too many entries to respond with a 1M packet) that much worse. how is multi-op better than shooting off a bunch of async exists?

        Show
        Benjamin Reed added a comment - why do you need multi op? you could do a getChildren() and then shoot off a bunch of async exists. the problem with doing stat with getChildren() is that you make the large directory problem (too many entries to respond with a 1M packet) that much worse. how is multi-op better than shooting off a bunch of async exists?
        Hide
        Ted Dunning added a comment -

        There are two issues.

        One is atomicity of gets or stat returning operations like exists. It can be very handy to have atomic gets at the least. For some other operations, it can be very nice to get an atomic set of version numbers. That is useful for implementing a multi-read-modify-write operation.

        A separate (sort of) question is whether there is a gitChildrenStat operation.

        Show
        Ted Dunning added a comment - There are two issues. One is atomicity of gets or stat returning operations like exists. It can be very handy to have atomic gets at the least. For some other operations, it can be very nice to get an atomic set of version numbers. That is useful for implementing a multi-read-modify-write operation. A separate (sort of) question is whether there is a gitChildrenStat operation.
        Hide
        Daniel Lord added a comment -

        Ben/Ted what do you think the performance characteristics would be of doing lots of quick reads would be? The reason that I'm requesting this feature is to avoid the "large directory problem" but also to avoid having to do a read for each child. Currently, there is a use case in my application that has cropped up with a very large directory and that has caused issues. I could use this operation (or the multi perhaps) to restructure the data that I'm writing to zk and avoid the issue. At my present scale to do this with a multi/individual stats would cost me approximately 1 getChildren followed by 2000 individual stats. I'm hoping to make our clusters larger and not smaller and if that costs too much I won't be able to continue to scale out.

        My current algorithm requires two getChildren operations. The first one will return approximately 2000 children and the second one could return several times that (lets say 10 times for the sake of arugment). If I can switch the way that the data is written and leverage a "getChildrenWithStat" operation then I would change the second getChildren to "getChildrenWithStat" and I would receive approximately 2000 children with stats in that case as well. Having the ability to getChildrenWithStat would also allow me to change what data needs to be encoded in the name of the znodes too.

        Using my current algorithm my two getChildren calls have roughly this cost associated:
        The first getChildren returns 2000 EPHEMERAL children with each child having a name with length 39. This could possibly be changed to a name with a length of 23 but that would limit my flexibility and would lose a small amount of information. That makes the total size of the response approximately 78k characters.
        The second getChildren returns 20000 EPHEMERAL_SEQUENTIAL children with each child having a name with length 89. This could be shortened to a length of 58 with minimal data loss or a length of 27 with some acceptable loss of data. That makes the total size of the response approximately 1780k characters. If the number of children in this call approaches 45k then I will exceed the default jute max buffer size and the whole cluster spins out of control. This number can grow without human intervention (it's not necessarily a function of the cluster size).

        Switching my algorithm how I would if I had getChildrenWithStat would have roughly this cost:
        The first getChildren call would remain identical.
        The second getChildren call would return 2000 PERSISTENT children with Stats. For the sake of argument let's say that the payload size is the same, per child, for these children as it was above. That puts this call at 178k. Additionally, this would scale as a function of the size of the cluster and not as something that is up to the behavior of the application.

        If my worries about doing 2000+ reads in order to accomplish this task are completely unfounded then we can probably close this issue and I'll either implement an async reads solution or use a multi-read solution. However, I fear that won't scale well. I have a large number of clients that all need to do this process fairly frequently. As long as my number of children has stayed less than 10k I have been having no problems. However, a bug was discovered (outside of this code but affecting this process) that caused the number of children to shoot up to 45k and cause the jute max buffer size to be exceeded. I'm trying to protect myself from that issue.

        Show
        Daniel Lord added a comment - Ben/Ted what do you think the performance characteristics would be of doing lots of quick reads would be? The reason that I'm requesting this feature is to avoid the "large directory problem" but also to avoid having to do a read for each child. Currently, there is a use case in my application that has cropped up with a very large directory and that has caused issues. I could use this operation (or the multi perhaps) to restructure the data that I'm writing to zk and avoid the issue. At my present scale to do this with a multi/individual stats would cost me approximately 1 getChildren followed by 2000 individual stats. I'm hoping to make our clusters larger and not smaller and if that costs too much I won't be able to continue to scale out. My current algorithm requires two getChildren operations. The first one will return approximately 2000 children and the second one could return several times that (lets say 10 times for the sake of arugment). If I can switch the way that the data is written and leverage a "getChildrenWithStat" operation then I would change the second getChildren to "getChildrenWithStat" and I would receive approximately 2000 children with stats in that case as well. Having the ability to getChildrenWithStat would also allow me to change what data needs to be encoded in the name of the znodes too. Using my current algorithm my two getChildren calls have roughly this cost associated: The first getChildren returns 2000 EPHEMERAL children with each child having a name with length 39. This could possibly be changed to a name with a length of 23 but that would limit my flexibility and would lose a small amount of information. That makes the total size of the response approximately 78k characters. The second getChildren returns 20000 EPHEMERAL_SEQUENTIAL children with each child having a name with length 89. This could be shortened to a length of 58 with minimal data loss or a length of 27 with some acceptable loss of data. That makes the total size of the response approximately 1780k characters. If the number of children in this call approaches 45k then I will exceed the default jute max buffer size and the whole cluster spins out of control. This number can grow without human intervention (it's not necessarily a function of the cluster size). Switching my algorithm how I would if I had getChildrenWithStat would have roughly this cost: The first getChildren call would remain identical. The second getChildren call would return 2000 PERSISTENT children with Stats. For the sake of argument let's say that the payload size is the same, per child, for these children as it was above. That puts this call at 178k. Additionally, this would scale as a function of the size of the cluster and not as something that is up to the behavior of the application. If my worries about doing 2000+ reads in order to accomplish this task are completely unfounded then we can probably close this issue and I'll either implement an async reads solution or use a multi-read solution. However, I fear that won't scale well. I have a large number of clients that all need to do this process fairly frequently. As long as my number of children has stayed less than 10k I have been having no problems. However, a bug was discovered (outside of this code but affecting this process) that caused the number of children to shoot up to 45k and cause the jute max buffer size to be exceeded. I'm trying to protect myself from that issue.
        Hide
        Ted Dunning added a comment -

        Another option would be to add one or a few comprehensive information files that contains information from all (some) of the children.

        When a child is created, it would be created as an ephemeral with no content and the corresponding comprehensive file is updated in an atomic read-modify write way.

        If the child departs in an orderly way, it could update the comprehensive file itself. Otherwise, you should have one or more janitor processes that watches the directory with all the ephemerals and which updates the comprehensive file as needed. The janitor could also be tasked with adding children to the comprehensive file if you like.

        The virtue of this method is that you avoid the need for your stat operation since you can simply read the comprehensive files and know everything you need to know. The anti-virtue is that there is a growing update which requires network traffic quadratic in the number of children being watched as nodes start or stop. If you use the janitor to do the updates, you can moderate this cost by having the janitor update at a fixed maximum rate of, say once every 5 seconds. During startup, many children would appear during this 5 second interval so the number of updates would be radically decreased.

        Another defect of this approach is that when children change their information, consumers of that information must read a relatively large chunk of information (the comprehensive file).

        Show
        Ted Dunning added a comment - Another option would be to add one or a few comprehensive information files that contains information from all (some) of the children. When a child is created, it would be created as an ephemeral with no content and the corresponding comprehensive file is updated in an atomic read-modify write way. If the child departs in an orderly way, it could update the comprehensive file itself. Otherwise, you should have one or more janitor processes that watches the directory with all the ephemerals and which updates the comprehensive file as needed. The janitor could also be tasked with adding children to the comprehensive file if you like. The virtue of this method is that you avoid the need for your stat operation since you can simply read the comprehensive files and know everything you need to know. The anti-virtue is that there is a growing update which requires network traffic quadratic in the number of children being watched as nodes start or stop. If you use the janitor to do the updates, you can moderate this cost by having the janitor update at a fixed maximum rate of, say once every 5 seconds. During startup, many children would appear during this 5 second interval so the number of updates would be radically decreased. Another defect of this approach is that when children change their information, consumers of that information must read a relatively large chunk of information (the comprehensive file).
        Hide
        Patrick Hunt added a comment -

        If the clients themselves registered with ZK using ephemeral nodes, and assuming the client's don't change state very often, then the janitor could use notifications to trigger updates to the info files (also batching updates as Ted suggests). The clients could even let the janitor know whether they cleaned up gracefully or not through zk. Downside is you'd need more than one janitor if you wanted to elim that as a spof.

        Depending on the lifetime of the leases and the number of clients involved you would also want to think about optimistic locking contention on the comprehensive file. (ie have more than one).

        Show
        Patrick Hunt added a comment - If the clients themselves registered with ZK using ephemeral nodes, and assuming the client's don't change state very often, then the janitor could use notifications to trigger updates to the info files (also batching updates as Ted suggests). The clients could even let the janitor know whether they cleaned up gracefully or not through zk. Downside is you'd need more than one janitor if you wanted to elim that as a spof. Depending on the lifetime of the leases and the number of clients involved you would also want to think about optimistic locking contention on the comprehensive file. (ie have more than one).
        Hide
        Ted Dunning added a comment -

        ... then the janitor could use notifications to trigger updates to the info files (also batching updates as Ted suggests).

        Yes. This is the second option I described, apparently poorly enough not to be clear.

        The clients could even let the janitor know whether they cleaned up gracefully or not through zk.

        That is a nice touch.

        Downside is you'd need more than one janitor if you wanted to elim that as a spof.

        Yes. Yet another leader-election application. One cute trick to eliminate herd effect is to have prospective leaders look at a count of candidates. If that is big, then they can just create an ephemeral in a candidate directory, but not actually become a candidate. If the number of candidates is relatively small, then they can join the fray. Non-candidates could watch for leader changes and reassess whether they should be a candidate whenever the leadership changes. Periodic polling could be used by non-candidates to assess whether there is any slippage going on and the polling can be adjusted based on the number of non-candidates to guarantee that candidates will reappear quickly without flooding ZK with polling queries.

        But that is off topic.

        Depending on the lifetime of the leases and the number of clients involved you would also want to think about optimistic locking contention on the comprehensive file. (ie have more than one).

        Show
        Ted Dunning added a comment - ... then the janitor could use notifications to trigger updates to the info files (also batching updates as Ted suggests). Yes. This is the second option I described, apparently poorly enough not to be clear. The clients could even let the janitor know whether they cleaned up gracefully or not through zk. That is a nice touch. Downside is you'd need more than one janitor if you wanted to elim that as a spof. Yes. Yet another leader-election application. One cute trick to eliminate herd effect is to have prospective leaders look at a count of candidates. If that is big, then they can just create an ephemeral in a candidate directory, but not actually become a candidate. If the number of candidates is relatively small, then they can join the fray. Non-candidates could watch for leader changes and reassess whether they should be a candidate whenever the leadership changes. Periodic polling could be used by non-candidates to assess whether there is any slippage going on and the polling can be adjusted based on the number of non-candidates to guarantee that candidates will reappear quickly without flooding ZK with polling queries. But that is off topic. Depending on the lifetime of the leases and the number of clients involved you would also want to think about optimistic locking contention on the comprehensive file. (ie have more than one).
        Hide
        Daniel Lord added a comment -

        I don't think that an "info file" and janitor process will solve my problem. I think this just shifts the information around a little bit but doesn't require any less data to be written.

        A little more background here. There are two parallel branches of the zk data tree in use.

        Services registry (there are many "ServiceName"s and many serviceProviders per ServiceName):
        /services/ServiceName/serviceProvider1 [EPHEMERAL]
        /services/ServiceName/serviceProvider2 [EPHEMERAL]
        ...
        /services/ServiceName/serviceProviderN [EPHEMERAL]

        This is the first getChildren list I was discussing above.

        Leases registry (There is a many-to-many relationship for clients to serviceProviders):
        /leases/ServiceName/serviceProvider1__client1_SEQUENCE_NUMBER [EPHEMERAL_SEQUENTIAL]
        /leases/ServiceName/serviceProvider2__client1_SEQUENCE_NUMBER [EPHEMERAL_SEQUENTIAL]
        ...
        /leases/ServiceName/serviceProviderN__clientX_SEQUENCE_NUMBER [EPHEMERAL_SEQUENTIAL]

        The problem is relatively unbounded growth in the leases registry.

        If I changed this to use a single info file then it would be something like this. You would have the same services registry. Instead of the leases registry you would have a clients registry (for the janitor process to watch and trigger clean up). There would also be a single leases info file that the clients would all write in to once the client had found a serviceProvider to create a lease to. This file would be highly contended since there are potentially thousands of clients. Additionally, there would need to be a large amount of data written to the file. The straight forward approach for what to write on this file would simply be a map of serviceProvider -> count. However, the janitor process would not be able to process a client disconnection in a meaningful manner with only that information. It wouldn't know how many leases to subtract from the map of serviceProvider -> for a given client disconnecting. Another approach to the global info file would be to write a mapping of client -> serviceProvider. However, that is essentially the same structure that's being written using the [EPHEMERAL_SEQUENTIAL] node now and would have the same problem of exceeding the jute max buffer size (this time we would likely exceed the size on the write instead of on the read but the consequences would probably be the same). A third option would be to simply write a count of client + serviceProvider -> count. That would be considerably less data and would still allow the janitor to clean up. This approach might work but has a lot of down sides too.

        If I changed the structure to use an info file per serviceProvider then I immediately fall back in to having to do serviceProvider + 1 reads in order to determine what serviceProvider to create a lease to. This is exactly what I want to have the "getChildrenWithStat" operation for – to use the numChildren field to track my counters and assign leases to the least leased host.

        I think both of these janitor based solutions increase complexity and lose some of the benefits of the EPHEMERAL nodes. I don't see how they'll help solve my problem either but maybe I'm still missing something?

        Show
        Daniel Lord added a comment - I don't think that an "info file" and janitor process will solve my problem. I think this just shifts the information around a little bit but doesn't require any less data to be written. A little more background here. There are two parallel branches of the zk data tree in use. Services registry (there are many "ServiceName"s and many serviceProviders per ServiceName): /services/ServiceName/serviceProvider1 [EPHEMERAL] /services/ServiceName/serviceProvider2 [EPHEMERAL] ... /services/ServiceName/serviceProviderN [EPHEMERAL] This is the first getChildren list I was discussing above. Leases registry (There is a many-to-many relationship for clients to serviceProviders): /leases/ServiceName/serviceProvider1__client1_SEQUENCE_NUMBER [EPHEMERAL_SEQUENTIAL] /leases/ServiceName/serviceProvider2__client1_SEQUENCE_NUMBER [EPHEMERAL_SEQUENTIAL] ... /leases/ServiceName/serviceProviderN__clientX_SEQUENCE_NUMBER [EPHEMERAL_SEQUENTIAL] The problem is relatively unbounded growth in the leases registry. If I changed this to use a single info file then it would be something like this. You would have the same services registry. Instead of the leases registry you would have a clients registry (for the janitor process to watch and trigger clean up). There would also be a single leases info file that the clients would all write in to once the client had found a serviceProvider to create a lease to. This file would be highly contended since there are potentially thousands of clients. Additionally, there would need to be a large amount of data written to the file. The straight forward approach for what to write on this file would simply be a map of serviceProvider -> count. However, the janitor process would not be able to process a client disconnection in a meaningful manner with only that information. It wouldn't know how many leases to subtract from the map of serviceProvider -> for a given client disconnecting. Another approach to the global info file would be to write a mapping of client -> serviceProvider. However, that is essentially the same structure that's being written using the [EPHEMERAL_SEQUENTIAL] node now and would have the same problem of exceeding the jute max buffer size (this time we would likely exceed the size on the write instead of on the read but the consequences would probably be the same). A third option would be to simply write a count of client + serviceProvider -> count. That would be considerably less data and would still allow the janitor to clean up. This approach might work but has a lot of down sides too. If I changed the structure to use an info file per serviceProvider then I immediately fall back in to having to do serviceProvider + 1 reads in order to determine what serviceProvider to create a lease to. This is exactly what I want to have the "getChildrenWithStat" operation for – to use the numChildren field to track my counters and assign leases to the least leased host. I think both of these janitor based solutions increase complexity and lose some of the benefits of the EPHEMERAL nodes. I don't see how they'll help solve my problem either but maybe I'm still missing something?
        Hide
        Ted Dunning added a comment -

        Daniel,

        The conventional answer to contention and size such as you are describing is to simply shard the info file. If you shard this file 50x then you can have 50 threads updating simultaneously and you decrease the dead weight data being passed around as part of the read-modify-write operation by 50x as well.

        The real point is that you can have a relatively small number of files so that you can now batch your status reads. This decreases the number of transactions required to read the status of all of the leases.

        Note also that since the shards have disjoint information, atomicity of reading is probably not an issue.

        Show
        Ted Dunning added a comment - Daniel, The conventional answer to contention and size such as you are describing is to simply shard the info file. If you shard this file 50x then you can have 50 threads updating simultaneously and you decrease the dead weight data being passed around as part of the read-modify-write operation by 50x as well. The real point is that you can have a relatively small number of files so that you can now batch your status reads. This decreases the number of transactions required to read the status of all of the leases. Note also that since the shards have disjoint information, atomicity of reading is probably not an issue.
        Hide
        Daniel Lord added a comment -

        OK Ted, I still don't see a benefit to sharding with an info file vs sharding on a set of PERSISTENT znodes and depending on ephemerals to track the data. If I can avoid having to have a janitor/reaper process I would be very happy. In any case sharding is certainly a possible solution that has its up sides and its down sides. I'll certainly keep that in mind and it's something that has been kicked around a bit. I'd prefer a more conceptually pure, or ideal, solution that doesn't require arbitrary sharding but if this is the best approach given the constraints so be it.

        That being said can any one quantify the relatively performance characteristics of the following proposals for how to implement a "getChildrenWithStat" operation (whether it is implemented directly via zk or if I simply implement it myself with multiple reads)?
        1. Implement a new zk method getChildrenWithStat that simply sits on top of the current DataTree and has to do some iteration over the children nodes to build the stats.
        2. Restructure the DataTree/DataNodes so that they keep track of the DataNodes for the children instead of relative paths.
        – Note I'm not sure what the implementation concerns around this would be. It's possible that this was done to conserve memory or other concerns? Would this even help the server to answer this complex request more quickly/cleanly?
        3. Use a multi-op read to read all of the Stats "in one shot".
        4. Fire off a bunch of async reads and wait for them all to come back.

        I'm not concerned about the increase to the response size that will come back because of having a ton of Stats come in all at once. It will give me considerably more flexibility and will actually let me shrink the amount of data that I need to read to do this process considerably.

        If none of these truly work out then I think sharding is what I'll have to do. If I can avoid that I would love it though.

        Show
        Daniel Lord added a comment - OK Ted, I still don't see a benefit to sharding with an info file vs sharding on a set of PERSISTENT znodes and depending on ephemerals to track the data. If I can avoid having to have a janitor/reaper process I would be very happy. In any case sharding is certainly a possible solution that has its up sides and its down sides. I'll certainly keep that in mind and it's something that has been kicked around a bit. I'd prefer a more conceptually pure, or ideal, solution that doesn't require arbitrary sharding but if this is the best approach given the constraints so be it. That being said can any one quantify the relatively performance characteristics of the following proposals for how to implement a "getChildrenWithStat" operation (whether it is implemented directly via zk or if I simply implement it myself with multiple reads)? 1. Implement a new zk method getChildrenWithStat that simply sits on top of the current DataTree and has to do some iteration over the children nodes to build the stats. 2. Restructure the DataTree/DataNodes so that they keep track of the DataNodes for the children instead of relative paths. – Note I'm not sure what the implementation concerns around this would be. It's possible that this was done to conserve memory or other concerns? Would this even help the server to answer this complex request more quickly/cleanly? 3. Use a multi-op read to read all of the Stats "in one shot". 4. Fire off a bunch of async reads and wait for them all to come back. I'm not concerned about the increase to the response size that will come back because of having a ton of Stats come in all at once. It will give me considerably more flexibility and will actually let me shrink the amount of data that I need to read to do this process considerably. If none of these truly work out then I think sharding is what I'll have to do. If I can avoid that I would love it though.
        Hide
        Patrick Hunt added a comment -

        What's the contract you're proposing for getChildrenWithStat? atomic? because if so that means locking all the child datanodes (large in your case) and holding all of those locks while you extract the name and stat information. Could really impact latency esp given one might have a large number of clients making this request (ie that's where it's interesting as a solution).

        You could relax the requirement, it would still require a lock for each DN but that would still be alot of locks. Also how interesting would this feature be for the user base at large if it was relaxed...

        2) I believe we have a list of child names specifically to make getChildren fast. We just need to lock a single DN (the parent) and then copy the names. create/delete are amortized.

        1) seems reasonable, but then there's the issue I mention earlier in this comment.

        3) same issue as 1.

        4) that could be potentially massive number of requests (clients X providers), guess it depends how fine grain the leases are.

        Show
        Patrick Hunt added a comment - What's the contract you're proposing for getChildrenWithStat? atomic? because if so that means locking all the child datanodes (large in your case) and holding all of those locks while you extract the name and stat information. Could really impact latency esp given one might have a large number of clients making this request (ie that's where it's interesting as a solution). You could relax the requirement, it would still require a lock for each DN but that would still be alot of locks. Also how interesting would this feature be for the user base at large if it was relaxed... 2) I believe we have a list of child names specifically to make getChildren fast. We just need to lock a single DN (the parent) and then copy the names. create/delete are amortized. 1) seems reasonable, but then there's the issue I mention earlier in this comment. 3) same issue as 1. 4) that could be potentially massive number of requests (clients X providers), guess it depends how fine grain the leases are.
        Hide
        Daniel Lord added a comment -

        I think it would have to be atomic in order to remain consistent and generally useful. I certainly have a different set of requirements for this call but I don't think that it makes sense to loosen the strong atomicity and consistency guarantees in order to satisfy my requirements on this. I think, as Patrick has alluded to, that my project is on a considerably larger scale than most users and having an atomic operation is potentially not as expensive for other users.

        As far as #2 goes I'm not sure that should make much of a difference. At least not with respect to locking. Since a node can't change names as soon as you acquire the lock on the parent DataNode no children can be added or removed. This means that you could calculate the names without locking each individual child DataNode right?

        As far as the locking problem goes is the real concern here over the contention of locks from multiple clients reading? If so moving to a read/write lock might be appropriate. That's starting to push the scope a bit but I'm just interested in exploring the issues/concerns that people have. This approach is definitely not the only one that is out there but it is, at the moment, the most attractive to me.

        Show
        Daniel Lord added a comment - I think it would have to be atomic in order to remain consistent and generally useful. I certainly have a different set of requirements for this call but I don't think that it makes sense to loosen the strong atomicity and consistency guarantees in order to satisfy my requirements on this. I think, as Patrick has alluded to, that my project is on a considerably larger scale than most users and having an atomic operation is potentially not as expensive for other users. As far as #2 goes I'm not sure that should make much of a difference. At least not with respect to locking. Since a node can't change names as soon as you acquire the lock on the parent DataNode no children can be added or removed. This means that you could calculate the names without locking each individual child DataNode right? As far as the locking problem goes is the real concern here over the contention of locks from multiple clients reading? If so moving to a read/write lock might be appropriate. That's starting to push the scope a bit but I'm just interested in exploring the issues/concerns that people have. This approach is definitely not the only one that is out there but it is, at the moment, the most attractive to me.
        Hide
        Patrick Hunt added a comment -

        Since a node can't change names as soon as you acquire the lock on the parent DataNode no children can be added or removed.

        I think you'd have to acquire a lock on the child in order to guarantee that the stat of the child would be processed atomically, given you want to include that in the response. So not just protecting against update of the parent's child list, but also protect against any changes in the children themselves.

        Show
        Patrick Hunt added a comment - Since a node can't change names as soon as you acquire the lock on the parent DataNode no children can be added or removed. I think you'd have to acquire a lock on the child in order to guarantee that the stat of the child would be processed atomically, given you want to include that in the response. So not just protecting against update of the parent's child list, but also protect against any changes in the children themselves.
        Hide
        Patrick Hunt added a comment -

        @daniel It would be easy for you to try this out. Modify the current getChildren to process lists of datanodes, serialize the result (with locking) and throw it away. You could exercise using a std client.

        Show
        Patrick Hunt added a comment - @daniel It would be easy for you to try this out. Modify the current getChildren to process lists of datanodes, serialize the result (with locking) and throw it away. You could exercise using a std client.
        Hide
        Daniel Lord added a comment -

        Right. I meant that in response to the point about storing a list of strings for children instead of storing a list of DataNodes for children. If you changed the data structure of a DataNode to store child DataNodes instead of child strings then you shouldn't require any different level of synchronization than you do now. Correct?

        I completely agree that if you want to have all of the child stats to be atomic then you'd have to lock all of the children to disallow modification.

        Show
        Daniel Lord added a comment - Right. I meant that in response to the point about storing a list of strings for children instead of storing a list of DataNodes for children. If you changed the data structure of a DataNode to store child DataNodes instead of child strings then you shouldn't require any different level of synchronization than you do now. Correct? I completely agree that if you want to have all of the child stats to be atomic then you'd have to lock all of the children to disallow modification.
        Hide
        Patrick Hunt added a comment -

        Gatcha. I was talking about withstat exclusively. I think you're right re legacy getChildren(path)

        Show
        Patrick Hunt added a comment - Gatcha. I was talking about withstat exclusively. I think you're right re legacy getChildren(path)

          People

          • Assignee:
            Unassigned
            Reporter:
            Daniel Lord
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:

              Development