Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-15413

DFSStripedInputStream throws exception when datanodes close idle connections

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsAdd voteVotersStop watchingWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • 3.1.3
    • None
    • ec, erasure-coding, hdfs-client
      • Hadoop 3.1.3
      • erasure coding with ISA-L and RS-3-2-1024k scheme
      • running in kubernetes
      • dfs.client.socket-timeout = 10000
      • dfs.datanode.socket.write.timeout = 10000

    Description

      We've run into an issue with compactions failing in HBase when erasure coding is enabled on a table directory. After digging further I was able to narrow it down to a seek + read logic and able to reproduce the issue with hdfs client only:

      import org.apache.hadoop.conf.Configuration;
      import org.apache.hadoop.fs.Path;
      import org.apache.hadoop.fs.FileSystem;
      import org.apache.hadoop.fs.FSDataInputStream;
      
      public class ReaderRaw {
          public static void main(final String[] args) throws Exception {
              Path p = new Path(args[0]);
              int bufLen = Integer.parseInt(args[1]);
              int sleepDuration = Integer.parseInt(args[2]);
              int countBeforeSleep = Integer.parseInt(args[3]);
              int countAfterSleep = Integer.parseInt(args[4]);
              Configuration conf = new Configuration();
      
              FSDataInputStream istream = FileSystem.get(conf).open(p);
      
              byte[] buf = new byte[bufLen];
              int readTotal = 0;
              int count = 0;
              try {
                while (true) {
                  istream.seek(readTotal);
      
                  int bytesRemaining = bufLen;
                  int bufOffset = 0;
                  while (bytesRemaining > 0) {
                    int nread = istream.read(buf, 0, bufLen);
                    if (nread < 0) {
                        throw new Exception("nread is less than zero");
                    }
                    readTotal += nread;
                    bufOffset += nread;
                    bytesRemaining -= nread;
                  }
      
                  count++;
                  if (count == countBeforeSleep) {
                      System.out.println("sleeping for " + sleepDuration + " milliseconds");
                      Thread.sleep(sleepDuration);
                      System.out.println("resuming");
                  }
                  if (count == countBeforeSleep + countAfterSleep) {
                      System.out.println("done");
                      break;
                  }
                }
              } catch (Exception e) {
                  System.out.println("exception on read " + count + " read total " + readTotal);
                  throw e;
              }
          }
      }
      

      The issue appears to be due to the fact that datanodes close the connection of EC client if it doesn't fetch next packet for longer than dfs.client.socket-timeout. The EC client doesn't retry and instead assumes that those datanodes went away resulting in "missing blocks" exception.

      I was able to consistently reproduce with the following arguments:

      bufLen = 1000000 (just below 1MB which is the size of the stripe) 
      sleepDuration = (dfs.client.socket-timeout + 1) * 1000 (in our case 11000)
      countBeforeSleep = 1
      countAfterSleep = 7
      

      I've attached the entire log output of running the snippet above against erasure coded file with RS-3-2-1024k policy. And here are the logs from datanodes of disconnecting the client:

      datanode 1:

      2020-06-15 19:06:20,697 INFO datanode.DataNode: Likely the client has stopped reading, disconnecting it (datanode-v11-0-hadoop.hadoop:9866:DataXceiver error processing READ_BLOCK operation  src: /10.128.23.40:53748 dst: /10.128.14.46:9866); java.net.SocketTimeoutException: 10000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/10.128.14.46:9866 remote=/10.128.23.40:53748]
      

      datanode 2:

      2020-06-15 19:06:20,341 INFO datanode.DataNode: Likely the client has stopped reading, disconnecting it (datanode-v11-1-hadoop.hadoop:9866:DataXceiver error processing READ_BLOCK operation  src: /10.128.23.40:48772 dst: /10.128.9.42:9866); java.net.SocketTimeoutException: 10000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/10.128.9.42:9866 remote=/10.128.23.40:48772]
      

      datanode 3:

      2020-06-15 19:06:20,467 INFO datanode.DataNode: Likely the client has stopped reading, disconnecting it (datanode-v11-3-hadoop.hadoop:9866:DataXceiver error processing READ_BLOCK operation  src: /10.128.23.40:57184 dst: /10.128.16.13:9866); java.net.SocketTimeoutException: 10000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/10.128.16.13:9866 remote=/10.128.23.40:57184]
      

      I've tried running the same code again non-ec files with replication of 3 and was not able to reproduce the issue with any parameters. Looking through the code, it's pretty clear that non-ec DFSInputStream retries reads after exception: https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L844

      Let me know if you need any more information that can help you out with addressing this issue.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned Assign to me
            timoha Andrey Elenskiy

            Dates

              Created:
              Updated:

              Slack

                Issue deployment