Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-14946

Erasure Coding: Block recovery failed during decommissioning

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.0.3, 3.2.1, 3.1.3
    • Fix Version/s: 3.3.0, 3.1.4, 3.2.2
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      DataNode logs as follow

      org.apache.hadoop.HadoopIllegalArgumentException: No enough valid inputs are provided, not recoverable
      at org.apache.hadoop.io.erasurecode.rawcoder.ByteBufferDecodingState.checkInputBuffers(ByteBufferDecodingState.java:119)
      at org.apache.hadoop.io.erasurecode.rawcoder.ByteBufferDecodingState.<init>(ByteBufferDecodingState.java:47)
      at org.apache.hadoop.io.erasurecode.rawcoder.RawErasureDecoder.decode(RawErasureDecoder.java:86)
      at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstructTargets(StripedBlockReconstructor.java:126)
      at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:97)
      at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60)
      at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
      at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      at java.lang.Thread.run(Thread.java:748)

      Block recovery always failed because of srcNodes in the wrong order

      Reproduce steps are:

      1. ec block (b0, b1, b2, b3, b4, b5, b6, b7, b8), b[0-8] are on dn[0-8], dn[0-3] are decommissioning
      2. dn[1-3] are decommissioned, dn0 are in decommissioning, ec block is [b0(decommissioning), b[1-3](decommissioned), b[4-8](live), b[0-3](live)]
      3. dn4 is crash, and b4 will be recovery, ec block is [b0(decommissioning), b[1-3](decommissioned), null, b[5-8](live), b[0-3](live)]

      We can see error log as above, and b4 is not recovery successfuly. Because srcNodes transfered to recovery datanode contains block [b0, b[5-8],b[0-3]], and datanode use [b0, b[5-8], b0](minRequiredSources Readers to reconstruct, minRequiredSources = Math.min(cellsNum, dataBlkNum)) to recovery the missing block.

        Attachments

        1. HDFS-14946-branch-3.2.001.patch
          8 kB
          Fei Hui
        2. HDFS-14946-branch-3.1.001.patch
          8 kB
          Fei Hui
        3. HDFS-14946.004.patch
          8 kB
          Fei Hui
        4. HDFS-14946.003.patch
          8 kB
          Fei Hui
        5. HDFS-14946.002.patch
          8 kB
          Fei Hui
        6. HDFS-14946.001.patch
          8 kB
          Fei Hui

          Activity

            People

            • Assignee:
              ferhui Fei Hui
              Reporter:
              ferhui Fei Hui
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: