Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-7441

More accurate detection for slow node in HDFS write pipeline

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      A DN could be slow due to OS or HW issues. HDFS write pipeline sometimes couldn't detect the slow DN correctly. Detection for "slow node" might not be specific to HDFS write pipeline. When a node is slow due to OS/HW issue, it is better to exclude it from HDFS read or write as well as YARN/MR operations. The issue here is the write operation takes a long time for a given block. We need some mechanism to detect such situation reliably for high throughput applications.

      In the following example, MR task runs on 1.2.3.4. 1.2.3.4 is the slow DN that should have been removed. But HDFS took out the healthy DN 5.6.7.8. With the new pipeline, HDFS continued to take out the newly added healthy DN 9.10.11.12, etc.

      DFSClient log on 1.2.3.4

      2014-11-19 20:50:22,601 WARN [ResponseProcessor for block blk_1157561391_1102030131492] org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for block blk_1157561391_1102030131492
      java.io.IOException: Bad response ERROR for block blk_1157561391_1102030131492 from datanode 5.6.7.8:50010 at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:823)
      2014-11-19 20:50:22,977 WARN [DataStreamer for file ...  block blk_1157561391_1102030131492] org.apache.hadoop.hdfs.DFSClient: Error Recovery for blk_1157561391_1102030131492 in pipeline 1.2.3.4:50010, 5.6.7.8:50010: bad datanode 5.6.7.8:50010
      

      DN Log on 1.2.3.4

      2014-11-19 20:49:56,539 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock blk_1157561391_1102030131492 received exception java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/1.2.3.4:50010 remote=/1.2.3.4:32844]
      ...
      java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/1.2.3.4:50010 remote=/1.2.3.4:32844]
              at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
              at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
              at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
              at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
              at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
              at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
              at java.io.DataInputStream.read(DataInputStream.java:149)
              at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
              at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
              at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
              at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
              at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:446)
              at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:702)
              at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:739)
              at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
              at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
              at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
      

      DN Log on 5.6.7.8

      2014-11-19 20:49:56,275 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for blk_1157561391_1102030131492
      java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/5.6.7.8:50010 remote=/1.2.3.4:48858]
              at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
              at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
              at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
              at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
              at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
              at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
              at java.io.DataInputStream.read(DataInputStream.java:149)
              at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
              at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
              at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
              at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
              at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:446)
              at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:702)
              at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:739)
              at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
              at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
              at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
              at java.lang.Thread.run(Thread.java:745)
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            mingma Ming Ma
            Votes:
            1 Vote for this issue
            Watchers:
            12 Start watching this issue

            Dates

              Created:
              Updated: