I don't think that an "info file" and janitor process will solve my problem. I think this just shifts the information around a little bit but doesn't require any less data to be written.
A little more background here. There are two parallel branches of the zk data tree in use.
Services registry (there are many "ServiceName"s and many serviceProviders per ServiceName):
This is the first getChildren list I was discussing above.
Leases registry (There is a many-to-many relationship for clients to serviceProviders):
The problem is relatively unbounded growth in the leases registry.
If I changed this to use a single info file then it would be something like this. You would have the same services registry. Instead of the leases registry you would have a clients registry (for the janitor process to watch and trigger clean up). There would also be a single leases info file that the clients would all write in to once the client had found a serviceProvider to create a lease to. This file would be highly contended since there are potentially thousands of clients. Additionally, there would need to be a large amount of data written to the file. The straight forward approach for what to write on this file would simply be a map of serviceProvider -> count. However, the janitor process would not be able to process a client disconnection in a meaningful manner with only that information. It wouldn't know how many leases to subtract from the map of serviceProvider -> for a given client disconnecting. Another approach to the global info file would be to write a mapping of client -> serviceProvider. However, that is essentially the same structure that's being written using the [EPHEMERAL_SEQUENTIAL] node now and would have the same problem of exceeding the jute max buffer size (this time we would likely exceed the size on the write instead of on the read but the consequences would probably be the same). A third option would be to simply write a count of client + serviceProvider -> count. That would be considerably less data and would still allow the janitor to clean up. This approach might work but has a lot of down sides too.
If I changed the structure to use an info file per serviceProvider then I immediately fall back in to having to do serviceProvider + 1 reads in order to determine what serviceProvider to create a lease to. This is exactly what I want to have the "getChildrenWithStat" operation for – to use the numChildren field to track my counters and assign leases to the least leased host.
I think both of these janitor based solutions increase complexity and lose some of the benefits of the EPHEMERAL nodes. I don't see how they'll help solve my problem either but maybe I'm still missing something?