Details
Description
Problem:
When active NameNode is restarted and loading fsimage, DFSRouters significantly slow down.
Investigation:
When active NameNode is restarted and loading fsimage, RouterRpcClient receives SocketException. Since RouterRpcClient#isUnavailableException(IOException) returns false when the argument is SocketException, the MembershipNameNodeResolver#cacheNS is not refreshed. That's why the order of the NameNodes returned by MemberShipNameNodeResolver#getNamenodesForNameserviceId(String) is unchanged and the active NameNode is still returned first. Therefore RouterRpcClient still tries to connect to the NameNode that is loading fsimage.
After loading the fsimage, the NameNode throws StandbyException. The exception is one of the 'Unavailable Exception' and the cacheNS is refreshed.
Workaround:
Stop NameNode and wait 1 minute before starting NameNode instead of restarting.
Attachments
Issue Links
- is related to
-
HDFS-15575 RBF: Create test cases that simulate general exceptions on NameNodes
- Open
- links to