Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
Reviewed
Description
Currently in BPServiceActor#offerService, when datanode runs into a local IOException, the DataNode only logs the exception and runs into the while loop again:
} catch(RemoteException re) { ....... LOG.warn("RemoteException in offerService", re); try { long sleepTime = Math.min(1000, dnConf.heartBeatInterval); Thread.sleep(sleepTime); } catch (InterruptedException ie) { Thread.currentThread().interrupt(); } } catch (IOException e) { LOG.warn("IOException in offerService", e); }
This tight loop may cause some issue. For example, in a production cluster, we saw a DataNode hit exception when doing kerberos realm lookup. This tight loop finally caused the DataNode to send hundreds of DNS lookup queries per second.