Carrying over from
HBASE-3159, I ran into a case with TestRollingRestart with META and ROOT failing concurrently. This is how it played out:
META is closed on an RS that has been stopped:
Then ROOT is closed on a different RS that has been stopped:
A running RS is assigned META (the master isn't even aware yet that root has been closed, it is processing shutdown for RS 59709 but not yet received expired node for 59662):
After finishing the open of META, the RS goes to update location in ROOT and gets:
This doesn't actually kill the RS, it's just a caught exception up in the generic EventHandler. But we get left in a weird state. Eventually master does the right thing and times-out the OPENING:
But it chooses to assign it back to the same person because the plan is still there:
But then the RS doesn't open it because it's actually already open on that server. We fail the ROOT edit but then don't close the region out.
This continues indefinitely, once every minute.
1. Address the race condition when we get the connection to the root server (could exist for meta too). The blocking call thinks we have a location but then when we get the cached location and don't get one.
2. If we do get this NPE writing to root (or maybe meta too), we should not just throw the exception all the way up to the EventHandler and log it and continue. That just stops our META_OPEN in it's tracks. We complete the open but just not the edit. We don't roll-back in any way.
3. If the master is assigning stuff out and a region says, hey, I'm already hosting this region... something must be up. In this case, it would not have been good for the RS to tell the master that it was already hosting it because it was missing the root edit. So maybe if this happens, the master asks the RS to close the region in question? Dunno.
Probably more issues to think about around this
This seems to be extremely rare. I have been running this TestRollingRestart script constantly and this only happens when I do a concurrent kill of the server hosting ROOT and then server hosting META, and then only sometimes, it does work more times than not.
|Transition||Time In Source Status||Execution Times||Last Executer||Last Execution Date|
|1d 5h 50m||1||Jonathan Gray||29/Oct/10 23:15|
|18h 49m||1||Jonathan Gray||30/Oct/10 18:04|
|Status||Patch Available [ 10002 ]||Resolved [ 5 ]|
|Resolution||Fixed [ 1 ]|
|Status||Open [ 1 ]||Patch Available [ 10002 ]|
|Field||Original Value||New Value|
|Assignee||Jonathan Gray [ streamy ]|