Affects Version/s: 0.90.0
Fix Version/s: 0.90.0
This is pretty ugly. In short, on a heavily loaded cluster, we are queuing multiple instances of region close. They all try to run confusing state.
I have a messy cluster. Its 16k regions on 8 servers. One node has 5k or so regions on it. Heaps are 1G all around. My master had OOME'd. Not sure why but not too worried about it for now. So, new master comes up and is trying to rebalance the cluster:
The balancer ends up sending many closes to a single overloaded server are taking so long, the close times out in RIT. We then do this:
We queue a new close (Should we?).
We time out a few more times (9 times) and each time we queue a new close.
Eventually the close succeeds, the region gets assigned a new location.
Then the next close pops off the eventhandler queue.
Here is the telltale signature of stuff gone amiss:
Notice how state is OPEN when we are forcing offline (It was actually just successfully opened). We end up assigning same server because plan was still around:
But later when plan is cleared, we assign new server and we have dbl-assignment.