Answers lifted from email also (some fixes + one answer was modified due to clarification here ).
What is a failure and how do you react to failures? I think the master5 design needs to spend more effort to considering failure and recovery cases. I claim there are 4 types of responses from a networked IO operation - two states we normally deal with ack successful, ack failed (nack) and unknown due to timeout that succeeded (timeout success) and unknown due to timeout that failed (timeout failed). We have historically missed the last two cases and they aren't considered in the master5 design.
There are a few considerations. Let me examine if there are other cases than these.
I am assuming the collocated table, which should reduce such cases for state (probably, if collocated table cannot be written reliably, master must stop-the-world and fail over).
When RS contacts master to do state update, it errs on the side of caution - no state update, no open region (or split).
Thus, except for the case of multiple masters running, we can always assume RS didn't online the region if we don't know about it.
Then, for messages to RS, see "Note on messages"; they are idempotent so they can always be resent.
1) State update coordination. What is a "state updates from the outside" Do RS's initiate splitting on their own? Maybe a picture would help so we can figure out if it is similar or different from hbck-master's?
Yes, these are RS messages. They are mentioned in some operation descriptions in part 2 - opening->opened, closing->closed; splitting, etc.
2) Single point of truth. hbck-master tries to define what single point of truth means by defining intended, current, and actual state data with durability properties on each kind. What do clients look at who modifies what?
Sorry, don't understand the question. I mean single source of truth mainly about what is going on with the region; it is described in design considerations.
I like the idea of "intended state", however without more detailed reading I am not sure how it works for multiple ops e.g. master recovering the region while the user intends to split it, so the split should be executed after it's opened.
3) Table record: "if regions is out of date, it should be closed and reopened". It is not clear in master5 how regionservers find out that they are out of date. Moreover, how do clients talking to those RS's with stale versions know they are going to the correct RS especially in the face of RS failures due to timeout?
On alter (and startup if failed), master tries to reopen all regions that are out of date.
Regions that are not opened with either pick up the new version when they are opened, or (e.g. if they are now Opening with old version) master discovers they are out of date when they are transitioned to Opened by RS, and reopens them again.
As for any case of alter on enabled table, there are no guarantees for clients.
To provide these w/o disable/enable (or logical equivalent of coordinating all close-s and open-s), one would need some form of version-time-travel, or waiting for versions, or both.
4) region record: transition states. This is really similar to hbck-masters current state and intended state. Shouldn't be defined as part of the region record?
I mention somewhere that could be done. One thing is that if several paths are possible between states, it's useful to know which is taken.
But do note that I store user intent separately from what is currently going on, so they are not exactly similar as far as I see.
5) Note on user operations: the forgetting thing is scary to me – in your move split example, what happens if an RS reads state that is forgotten?
I think my description of this might be too vague. State is not forgotten; previous intent is forgotten. I.e. if user does several operations in order that conflict (e.g. split and then merge), the first one will be canceled (safely ).
Also, RS does not read state as a guideline to what needs to be done.
6) table state machine. how do we guarantee clients are writing from the correct version in the in failures?
Can you please elaborate?
7) region state machine. Earlier draft hand splitting and merge cases. Are they elided in master5 or are not present any more. How would this get extended handle jeffrey's distributed log replay/fast write recovery feature?
As I mention somewhere these could be separate states. I was kind of afraid of blowing up state machine too much, so I noticed that for split/merge you anyway store siblings/children, so you can recognize them and for most purposes different split-merge states are the same as Opened and Closed.
I will add those back, it would make sense.
8) logical interactions: sounds like master5 allows concurrent region and table operations. hbck-master (though not fully documented) only allows certain region transitions when the table is enabled or if the table is disabled. Are we sure we don't get into race conditions? What happens if disable gets issued – its possible for someone to reopens the region and for old clients to continue writing to it even though it is closed?
Yes, parallelism is intended. You can never be sure you have no races but we should aim for it
master5 is missing disabled/enabled check, that is a mistake.
Part1 operation interactions already cover it:
table disable doesn't ack until all regions are closed (master5 is wrong ).
region opening cannot start if table is already disabling or disabled.
if region is already opening when disable is issued, opening will be opportunistically canceled.
if disable fails to cancel opening, or server opens it first in a race, region will be opened, and master will issue close immediately after state update. Given that region is not closed, disable is not complete.
if opening (or closing) times out, master will fence off RS and mark region as closed. If there was some way of fencing region separately (ZK lease?) it would be possible to use that.
In any case, until client checks table state before every write, there's no easy way to prevent writes on disabling table. Writes on disabled table will not be possible.
On ensuring there's no double assignment due to RS hanging:
The intent is to fence the WAL for region server, the way we do now. One could also use other mechanism.
Perhaps I could specify it more clearly; I think the problem of making sure RS is dead is nearly orthogonal.
In my model, due to how opening region is committed to opened, we can only be unsure when the region is in Opened state (or similar states such as Splitting which are not present in my current version, but will be added).
In that case, in absence of normal transition, we cannot do literally anything with the region unless we are sufficiently sure that RS is sufficiently dead (e.g. cannot write).
So, while we ensure that RS is dead we don't reassign.
My document implies (but doesn't elaborate, I'll fix that) that master does direct Opened->Closed direct transition only when that is true.
A state called "MaybeOpened" could be added. Let me add it...