Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-11333

Fix deadlock in DomainSocketWatcher when the notification pipe is full

    XMLWordPrintableJSON

Details

    Description

      I found some of our DataNodes will run "exceeds the limit of concurrent xciever", the limit is 4K.

      After check the stack, I suspect that org.apache.hadoop.net.unix.DomainSocket.writeArray0 which called by DomainSocketWatcher.kick stuck:

      "DataXceiver for client unix:/var/run/hadoop-hdfs/dn Waiting for operation #1" daemon prio=10 tid=0x00007f55c5576000 nid=0x385d waiting on condition [0x00007f558d5d4000]
      java.lang.Thread.State: WAITING (parking)
      at sun.misc.Unsafe.park(Native Method)

      • parking to wait for <0x0000000740df9c90> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197)
        at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214)
        at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
        at org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:286)
        at org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)

        "DataXceiver for client unix:/var/run/hadoop-hdfs/dn Waiting for operation #1" daemon prio=10 tid=0x00007f7de034c800 nid=0x7b7 runnable [0x00007f7db06c5000]
        java.lang.Thread.State: RUNNABLE
        at org.apache.hadoop.net.unix.DomainSocket.writeArray0(Native Method)
        at org.apache.hadoop.net.unix.DomainSocket.access$300(DomainSocket.java:45)
        at org.apache.hadoop.net.unix.DomainSocket$DomainOutputStream.write(DomainSocket.java:589)
        at org.apache.hadoop.net.unix.DomainSocketWatcher.kick(DomainSocketWatcher.java:350)
        at org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:303)
        at org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
        at java.lang.Thread.run(Thread.java:745)

      "DataXceiver for client unix:/var/run/hadoop-hdfs/dn Waiting for operation #1" daemon prio=10 tid=0x00007f55c5574000 nid=0x377a waiting on condition [0x00007f558d7d6000]
      java.lang.Thread.State: WAITING (parking)
      at sun.misc.Unsafe.park(Native Method)

      • parking to wait for <0x0000000740df9cb0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
        at org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:306)
        at org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
        at java.lang.Thread.run(Thread.java:745)

      "Thread-163852" daemon prio=10 tid=0x00007f55c811c800 nid=0x6757 runnable [0x00007f55aef6e000]
      java.lang.Thread.State: RUNNABLE
      at org.apache.hadoop.net.unix.DomainSocketWatcher.doPoll0(Native Method)
      at org.apache.hadoop.net.unix.DomainSocketWatcher.access$800(DomainSocketWatcher.java:52)
      at org.apache.hadoop.net.unix.DomainSocketWatcher$1.run(DomainSocketWatcher.java:457)
      at java.lang.Thread.run(Thread.java:745)

      Attachments

        1. 11241021
          1.94 MB
          yunjiong zhao
        2. 11241023
          1.96 MB
          yunjiong zhao
        3. 11241025
          1.97 MB
          yunjiong zhao
        4. HADOOP-11333.patch
          1 kB
          yunjiong zhao
        5. HADOOP-11333-1.patch
          2 kB
          yunjiong zhao

        Issue Links

          Activity

            People

              zhaoyunjiong yunjiong zhao
              zhaoyunjiong yunjiong zhao
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: