Looks like the region server responses are being processed incorrectly in places allowing te region to be opened on two servers.
- The region server report handling in procedures should check which server is reporting.
- Also although I didn't check (and it isn't implicated in this bug), RS must check in OPEN that it's actually the correct RS master sent open to (w.r.t. start timestamp)
This was previosuly "mitigated" by master killing the RS with incorrect reports, but due to race conditions with reports and assignment the kill was replaced with a warning, so now this condition persists.
Regardless, the kill approach is not a good fix because there's still a window when a region can be opened on two servers.
A region is being opened by server_48c. The server dies, and we process the retry correctly (retry=3 because 2 previous similar open failures were processed correctly).
We start opening it on server_1aa now.
However, we get the remote procedure failure from 48c after we've already started that.
It actually tried to open on the restarted RS, which makes me wonder if this is safe also w.r.t. other races - what if RS already initialized and didn't error out?
Need to check if we verify the start code expected by master on RS when opening.
Without any other reason (at least logged), the RIT immediately retries again and chooses a new candidate. It then retries again and goes to the new 48c, but that's unrelated.
What does happen though is that 1aa, that never got a chance to respond at the time that the RIT erroneously retried above, finishes opening the region - which master ignores
And starts spamming these warnings until finally the region is open in two places.
This can result in data loss.