Description
For each listening port, IPC Server#Listener#Reader is a single thread in charge of moving Connection items from pendingConnections (capacity 100) to the callQueue.
We have experienced an incident where the Reader thread for HDFS NameNode died from runtime exception. Then the pendingConnections queue became full and the NameNode port became inaccessible.
In our particular case, what killed Reader was a NPE caused by https://bugs.openjdk.java.net/browse/JDK-8024883. But in general, other types of runtime exceptions could cause this issue as well.
We should add logic to either make the Reader more robust in case of runtime exceptions, or at least treat it as a FATAL exception so that NameNode can fail over to standby, and admins get alerted of the real issue.
Attachments
Issue Links
- is duplicated by
-
HADOOP-11780 Prevent IPC reader thread death
- Resolved