Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-11333

Fix deadlock in DomainSocketWatcher when the notification pipe is full

    XMLWordPrintableJSON

Details

    Description

      I found some of our DataNodes will run "exceeds the limit of concurrent xciever", the limit is 4K.

      After check the stack, I suspect that org.apache.hadoop.net.unix.DomainSocket.writeArray0 which called by DomainSocketWatcher.kick stuck:

      "DataXceiver for client unix:/var/run/hadoop-hdfs/dn Waiting for operation #1" daemon prio=10 tid=0x00007f55c5576000 nid=0x385d waiting on condition [0x00007f558d5d4000]
      java.lang.Thread.State: WAITING (parking)
      at sun.misc.Unsafe.park(Native Method)

      • parking to wait for <0x0000000740df9c90> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197)
        at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214)
        at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
        at org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:286)
        at org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)

        "DataXceiver for client unix:/var/run/hadoop-hdfs/dn Waiting for operation #1" daemon prio=10 tid=0x00007f7de034c800 nid=0x7b7 runnable [0x00007f7db06c5000]
        java.lang.Thread.State: RUNNABLE
        at org.apache.hadoop.net.unix.DomainSocket.writeArray0(Native Method)
        at org.apache.hadoop.net.unix.DomainSocket.access$300(DomainSocket.java:45)
        at org.apache.hadoop.net.unix.DomainSocket$DomainOutputStream.write(DomainSocket.java:589)
        at org.apache.hadoop.net.unix.DomainSocketWatcher.kick(DomainSocketWatcher.java:350)
        at org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:303)
        at org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
        at java.lang.Thread.run(Thread.java:745)

      "DataXceiver for client unix:/var/run/hadoop-hdfs/dn Waiting for operation #1" daemon prio=10 tid=0x00007f55c5574000 nid=0x377a waiting on condition [0x00007f558d7d6000]
      java.lang.Thread.State: WAITING (parking)
      at sun.misc.Unsafe.park(Native Method)

      • parking to wait for <0x0000000740df9cb0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
        at org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:306)
        at org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
        at java.lang.Thread.run(Thread.java:745)

      "Thread-163852" daemon prio=10 tid=0x00007f55c811c800 nid=0x6757 runnable [0x00007f55aef6e000]
      java.lang.Thread.State: RUNNABLE
      at org.apache.hadoop.net.unix.DomainSocketWatcher.doPoll0(Native Method)
      at org.apache.hadoop.net.unix.DomainSocketWatcher.access$800(DomainSocketWatcher.java:52)
      at org.apache.hadoop.net.unix.DomainSocketWatcher$1.run(DomainSocketWatcher.java:457)
      at java.lang.Thread.run(Thread.java:745)

      Attachments

        1. HADOOP-11333-1.patch
          2 kB
          yunjiong zhao
        2. HADOOP-11333.patch
          1 kB
          yunjiong zhao
        3. 11241025
          1.97 MB
          yunjiong zhao
        4. 11241023
          1.96 MB
          yunjiong zhao
        5. 11241021
          1.94 MB
          yunjiong zhao

        Issue Links

          Activity

            People

              zhaoyunjiong yunjiong zhao
              zhaoyunjiong yunjiong zhao
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: