Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-8429

Avoid stuck threads if there is an error in DomainSocketWatcher that stops the thread

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.6.0
    • 2.8.0, 3.0.0-alpha1
    • None
    • None

    Description

      In our cluster, an application is hung when doing a short circuit read of local hdfs block. By looking into the log, we found the DataNode's DomainSocketWatcher.watcherThread has exited with following log:

      ERROR org.apache.hadoop.net.unix.DomainSocketWatcher: Thread[Thread-25,5,main] terminating on unexpected exception
      java.lang.NullPointerException
              at org.apache.hadoop.net.unix.DomainSocketWatcher$2.run(DomainSocketWatcher.java:463)
              at java.lang.Thread.run(Thread.java:662)
      

      The line 463 is following code snippet:

               try {
                  for (int fd : fdSet.getAndClearReadableFds()) {
                    sendCallbackAndRemove("getAndClearReadableFds", entries, fdSet,
                      fd);
                  }
      

      getAndClearReadableFds is a native method which will malloc an int array. Since our memory is very tight, it looks like the malloc failed and a NULL pointer is returned.

      The bad thing is that other threads then blocked in stack like this:

      "DataXceiver for client unix:/home/work/app/hdfs/c3prc-micloud/datanode/dn_socket [Waiting for operation #1]" daemon prio=10 tid=0x00007f0c9c086d90 nid=0x8fc3 waiting on condition [0x00007f09b9856000]
         java.lang.Thread.State: WAITING (parking)
              at sun.misc.Unsafe.park(Native Method)
              - parking to wait for  <0x00000007b0174808> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
              at java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
              at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)
              at org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:323)
              at org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:322)
              at org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:403)
              at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:214)
              at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:95)
              at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:235)
              at java.lang.Thread.run(Thread.java:662)
      

      IMO, we should exit the DN so that the users can know that something go wrong and fix it.

      Attachments

        1. HDFS-8429-001.patch
          3 kB
          zhouyingchao
        2. HDFS-8429-002.patch
          5 kB
          zhouyingchao
        3. HDFS-8429-003.patch
          6 kB
          zhouyingchao

        Activity

          People

            sinago zhouyingchao
            sinago zhouyingchao
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: