I'd be interested to see your thoughts about determining the minimum number of RSs.
How long you going to wait? Whats the trigger that has you start assigning?
This is a configuration parameter that already exists. We could also add some kind of timeout (once one RS checks in, we wait N seconds before starting up).
Also how do you handle the assignment of ROOT and META? What's the chance that there's always enough RSs that reported in once both catalog regions are assigned?
As I said, this doc leaves out catalog tables for now. One thing to remember, startup is no longer going to need to wait around for the meta scanner to kick in and region assignment should be significantly faster, so it could be potentially sub-second to assign them out.
So once the load balancer is started, do we block regions going in transition?
Well sometimes you can't. The idea is that load balancing runs when the cluster is at steady state.
[if regions in transiton] Doesn't just go back to sleep?
It could. Might be simpler approach except if you set a long period you'd potentially skip a load balance because of a single region in transition.
The load balancer will probably be the last thing I add, still some details to flesh out. I think it makes sense to not kick off a load balancing when there are already regions in transition (it makes sense as a background operation when the cluster is mostly at steady state from region movement standpoint). Whether it waits or skips the run is an open question, I'd opt for waiting. The next run of the load balancer would happen at the fixed period, maybe starting from the end of the previous run (rather than strictly periodic).
What happens if a region splits while load balancing is running?
Good question. Will try to address splits in the next revision. They are largely ignored in this one as is meta editing in general. Let me give that a crack.
"We assume that the Master will not fail until after the OFFLINE nodes have been created in ZK" And if it does? Is that for later?
Later, perhaps. Does not seem unreasonable that the Master can't fail during the first minute of the initial cluster startup but I suppose it could eventually be handled. In any case, I think this is okay for now.
Master sends RPCs to each RS, telling them to OPEN their regions. This is a new message?
Yup. There'll be a new OPEN and CLOSE RPC to RS. In one of my earlier branches for this I moved cluster admin methods to be direct RPC to RS as well but that's outside the scope of this for now.
Who updated .META.? The RS, right?
That is the plan. I will address META editing and splits in the next revision of the doc.
[load balancing] This has to be better than what we currently have?
Yes, this should be better (hard to be worse). Feasible to use ZK to make a synchronous client call as well (and could look at status of nodes in zk to see the status of regions transitioning).
"Master determines which regions need to be handled." How?
In the doc I describe that assignment information is sitting in-memory in the master, but not positive we should use this since a failed-over master may not have this yet. I left out what a failed-over master does to rebuild the assignments in-memory as well. The latter would be done with the meta scanner, the former could as well.
"Master sends RPCs to RSs to open all the regions." Whats this message look like? Its the new one mentioned above? Will it take time figuring what was on the dead RS? Or will it be near-instantaneous?
If we have up-to-date in-memory state to work with, its near-instantaneous. Meta scanning not so much. It'd be nice to be able to use the info in-memory if it is available, and if not complete (we're a failed-over master that hasn't fully recovered yet) we'd fall-back to meta scanning.
I don't see mention of wait on RS WAL log spitting in here. Should it be mentioned?
Yup. Good call.
Thanks for reviewing it guys. Not sure I'll have time this week/weekend to do a refresh but should have a revised one up sometime Monday.