Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-13657

IPC Reader thread could silently die and leave NameNode unresponsive

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Duplicate
    • None
    • None
    • ipc
    • None

    Description

      For each listening port, IPC Server#Listener#Reader is a single thread in charge of moving Connection items from pendingConnections (capacity 100) to the callQueue.

      We have experienced an incident where the Reader thread for HDFS NameNode died from runtime exception. Then the pendingConnections queue became full and the NameNode port became inaccessible.

      In our particular case, what killed Reader was a NPE caused by https://bugs.openjdk.java.net/browse/JDK-8024883. But in general, other types of runtime exceptions could cause this issue as well.

      We should add logic to either make the Reader more robust in case of runtime exceptions, or at least treat it as a FATAL exception so that NameNode can fail over to standby, and admins get alerted of the real issue.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              zhz Zhe Zhang
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: