Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-15779

EC: fix NPE caused by StripedWriter.clearBuffers during reconstruct block

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.2.0
    • Fix Version/s: 3.3.1, 3.4.0, 3.2.3
    • Component/s: None
    • Labels:
      None

      Description

      The NullPointerException in DN log as follows: 

      2020-12-28 15:49:25,453 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeCommand action: DNA_ERASURE_CODING_RECOVERY
      //...
      2020-12-28 15:51:25,551 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Connection timed out
      2020-12-28 15:51:25,553 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Failed to reconstruct striped block: BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036804064064_6311920695
      java.lang.NullPointerException
              at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedWriter.clearBuffers(StripedWriter.java:299)
              at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.clearBuffers(StripedBlockReconstructor.java:139)
              at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:115)
              at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60)
              at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
              at java.util.concurrent.FutureTask.run(FutureTask.java:266)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
              at java.lang.Thread.run(Thread.java:745)
      2020-12-28 15:51:25,749 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036799445643_6313197139 src: /10.83.xxx.52:53198 dest: /10.83.xxx.52:50
      010
      

      NPE occurs at `writer.getTargetBuffer()` in codes:

      // StripedWriter#clearBuffers
      void clearBuffers() {
        for (StripedBlockWriter writer : writers) {
          ByteBuffer targetBuffer = writer.getTargetBuffer();
          if (targetBuffer != null) {
            targetBuffer.clear();
          }
        }
      }
      

      So, why is the writer null? Let's track when the writer is initialized and when reconstruct() is called,  as follows:

      // StripedBlockReconstructor#run
      public void run() {
        try {
          initDecoderIfNecessary();
      
          getStripedReader().init();
      
          stripedWriter.init();  //①
      
          reconstruct();  //②
      
          stripedWriter.endTargetBlocks();
        } catch (Throwable e) {
          LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e);
          // ...

      They are called at ① and ② above respectively. `stripedWriter.init()` -> `initTargetStreams()`, as follows:

      // StripedWriter#initTargetStreams
      int initTargetStreams() {
        int nSuccess = 0;
        for (short i = 0; i < targets.length; i++) {
          try {
            writers[i] = createWriter(i);
            nSuccess++;
            targetsStatus[i] = true;
          } catch (Throwable e) {
            LOG.warn(e.getMessage());
          }
        }
        return nSuccess;
      }
      

      NPE occurs when createWriter() gets an exception and  0 < nSuccess < targets.length. 

        Attachments

        1. HDFS-15779.001.patch
          1 kB
          Hongbing Wang
        2. HDFS-15779.002.patch
          1 kB
          Hongbing Wang

          Activity

            People

            • Assignee:
              wanghongbing Hongbing Wang
              Reporter:
              wanghongbing Hongbing Wang
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: