Details
-
Bug
-
Status: Closed
-
Blocker
-
Resolution: Fixed
-
0.90.3
-
None
-
None
-
Reviewed
-
Description
After some extensive debugging in the thread A sudden msg of "java.io.IOException: Server not running, aborting", we figured that the region servers weren't able to talk to the new .META. location because the old one was still alive but on it's way down after a OOME.
It translates into exceptions like "Server not running" coming from trying to edit .META. and digging in the code I see that CT.waitForMetaServerConnectionDefault -> waitForMeta -> getMetaServerConnection(true) calls verifyRegionLocation since we force the refresh. In this method we check if the RS is good by calling getRegionInfo which does not check if the region server is trying to close.
What this means is that a cluster can't recover a .META.-serving RS failure until it has fully shutdown since every time a RS tries to open a region (like right after the log splitting) or split it fails editing .META.