Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-15413

DFSStripedInputStream throws exception when datanodes close idle connections

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • 3.1.3
    • None
    • ec, erasure-coding, hdfs-client
    • None
      • Hadoop 3.1.3
      • erasure coding with ISA-L and RS-3-2-1024k scheme
      • running in kubernetes
      • dfs.client.socket-timeout = 10000
      • dfs.datanode.socket.write.timeout = 10000

    Description

      We've run into an issue with compactions failing in HBase when erasure coding is enabled on a table directory. After digging further I was able to narrow it down to a seek + read logic and able to reproduce the issue with hdfs client only:

      import org.apache.hadoop.conf.Configuration;
      import org.apache.hadoop.fs.Path;
      import org.apache.hadoop.fs.FileSystem;
      import org.apache.hadoop.fs.FSDataInputStream;
      
      public class ReaderRaw {
          public static void main(final String[] args) throws Exception {
              Path p = new Path(args[0]);
              int bufLen = Integer.parseInt(args[1]);
              int sleepDuration = Integer.parseInt(args[2]);
              int countBeforeSleep = Integer.parseInt(args[3]);
              int countAfterSleep = Integer.parseInt(args[4]);
              Configuration conf = new Configuration();
      
              FSDataInputStream istream = FileSystem.get(conf).open(p);
      
              byte[] buf = new byte[bufLen];
              int readTotal = 0;
              int count = 0;
              try {
                while (true) {
                  istream.seek(readTotal);
      
                  int bytesRemaining = bufLen;
                  int bufOffset = 0;
                  while (bytesRemaining > 0) {
                    int nread = istream.read(buf, 0, bufLen);
                    if (nread < 0) {
                        throw new Exception("nread is less than zero");
                    }
                    readTotal += nread;
                    bufOffset += nread;
                    bytesRemaining -= nread;
                  }
      
                  count++;
                  if (count == countBeforeSleep) {
                      System.out.println("sleeping for " + sleepDuration + " milliseconds");
                      Thread.sleep(sleepDuration);
                      System.out.println("resuming");
                  }
                  if (count == countBeforeSleep + countAfterSleep) {
                      System.out.println("done");
                      break;
                  }
                }
              } catch (Exception e) {
                  System.out.println("exception on read " + count + " read total " + readTotal);
                  throw e;
              }
          }
      }
      

      The issue appears to be due to the fact that datanodes close the connection of EC client if it doesn't fetch next packet for longer than dfs.client.socket-timeout. The EC client doesn't retry and instead assumes that those datanodes went away resulting in "missing blocks" exception.

      I was able to consistently reproduce with the following arguments:

      bufLen = 1000000 (just below 1MB which is the size of the stripe) 
      sleepDuration = (dfs.client.socket-timeout + 1) * 1000 (in our case 11000)
      countBeforeSleep = 1
      countAfterSleep = 7
      

      I've attached the entire log output of running the snippet above against erasure coded file with RS-3-2-1024k policy. And here are the logs from datanodes of disconnecting the client:

      datanode 1:

      2020-06-15 19:06:20,697 INFO datanode.DataNode: Likely the client has stopped reading, disconnecting it (datanode-v11-0-hadoop.hadoop:9866:DataXceiver error processing READ_BLOCK operation  src: /10.128.23.40:53748 dst: /10.128.14.46:9866); java.net.SocketTimeoutException: 10000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/10.128.14.46:9866 remote=/10.128.23.40:53748]
      

      datanode 2:

      2020-06-15 19:06:20,341 INFO datanode.DataNode: Likely the client has stopped reading, disconnecting it (datanode-v11-1-hadoop.hadoop:9866:DataXceiver error processing READ_BLOCK operation  src: /10.128.23.40:48772 dst: /10.128.9.42:9866); java.net.SocketTimeoutException: 10000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/10.128.9.42:9866 remote=/10.128.23.40:48772]
      

      datanode 3:

      2020-06-15 19:06:20,467 INFO datanode.DataNode: Likely the client has stopped reading, disconnecting it (datanode-v11-3-hadoop.hadoop:9866:DataXceiver error processing READ_BLOCK operation  src: /10.128.23.40:57184 dst: /10.128.16.13:9866); java.net.SocketTimeoutException: 10000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/10.128.16.13:9866 remote=/10.128.23.40:57184]
      

      I've tried running the same code again non-ec files with replication of 3 and was not able to reproduce the issue with any parameters. Looking through the code, it's pretty clear that non-ec DFSInputStream retries reads after exception: https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L844

      Let me know if you need any more information that can help you out with addressing this issue.

      Attachments

        1. out.log
          94 kB
          Andrey Elenskiy

        Activity

          People

            Unassigned Unassigned
            timoha Andrey Elenskiy
            Votes:
            3 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

              Created:
              Updated: