Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-15407

Hedged read will not work if a datanode slow for a long time

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.1.1
    • None
    • 3.1.1, datanode
    • None

    Description

      I use cgroups to limit the datanode node IO to 1024Byte/s, use hedged read to read the file, (where dfs.client.hedged.read.threadpool.size is set to 5, dfs.client.hedged.read.threshold.millis is set to 500), the first 5 buffer read timeout, switch other datenode nodes to read successfully. Then stuck for a long time because of SocketTimeoutException. Log as follows

      2020-06-11 16:40:07,832 | INFO | main | Waited 500ms to read from DatanodeInfoWithStorage[xx.xx.xx.28:25009,DS-9c843ac6-4ea1-4791-a1af-54c1ae3d5daf,DISK]; spawning hedged read | DFSInputStream.java:1188
      2020-06-11 16:40:08,562 | INFO | main | Waited 500ms to read from DatanodeInfoWithStorage[xx.xx.xx.28:25009,DS-9c843ac6-4ea1-4791-a1af-54c1ae3d5daf,DISK]; spawning hedged read | DFSInputStream.java:1188
      2020-06-11 16:40:09,102 | INFO | main | Waited 500ms to read from DatanodeInfoWithStorage[xx.xx.xx.28:25009,DS-9c843ac6-4ea1-4791-a1af-54c1ae3d5daf,DISK]; spawning hedged read | DFSInputStream.java:1188
      2020-06-11 16:40:09,642 | INFO | main | Waited 500ms to read from DatanodeInfoWithStorage[xx.xx.xx.28:25009,DS-9c843ac6-4ea1-4791-a1af-54c1ae3d5daf,DISK]; spawning hedged read | DFSInputStream.java:1188
      2020-06-11 16:40:10,182 | INFO | main | Waited 500ms to read from DatanodeInfoWithStorage[xx.xx.xx.28:25009,DS-9c843ac6-4ea1-4791-a1af-54c1ae3d5daf,DISK]; spawning hedged read | DFSInputStream.java:1188
      2020-06-11 16:40:10,182 | INFO | main | Execution rejected, Executing in current thread | DFSClient.java:3049
      2020-06-11 16:40:10,219 | INFO | main | Execution rejected, Executing in current thread | DFSClient.java:3049
      2020-06-11 16:50:07,638 | WARN | hedgedRead-0 | I/O error constructing remote block reader. | BlockReaderFactory.java:764
      java.net.SocketTimeoutException: 600000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/xx.xx.xx.113:62750 remote=/xx.xx.xx.28:25009]
      at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
      at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
      at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
      at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
      at java.io.FilterInputStream.read(FilterInputStream.java:83)
      at org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:551)
      at org.apache.hadoop.hdfs.client.impl.BlockReaderRemote.newBlockReader(BlockReaderRemote.java:418)
      at org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:853)
      at org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:749)
      at org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:379)
      at org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:661)
      at org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1063)
      at org.apache.hadoop.hdfs.DFSInputStream$2.call(DFSInputStream.java:1035)
      at org.apache.hadoop.hdfs.DFSInputStream$2.call(DFSInputStream.java:1031)
      at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
      at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      at java.lang.Thread.run(Thread.java:748)
      2020-06-11 16:50:07,638 | WARN | hedgedRead-0 | Connection failure: Failed to connect to /xx.xx.xx.28:25009 for file /testhdfs/test2.jar for block BP-1820384660-xx.xx.xx.74-1585533043013:blk_1082582662_8861386:java.net.SocketTimeoutException: 600000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/xx.xx.xx.113:62750 remote=/xx.xx.xx.28:25009] | DFSInputStream.java:1118
       

      Attachments

        Activity

          People

            rain_lyy liuyanyu
            rain_lyy liuyanyu
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: