Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-14946

Erasure Coding: Block recovery failed during decommissioning

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.0.3, 3.2.1, 3.1.3
    • 3.3.0, 3.1.4, 3.2.2
    • None
    • None
    • Reviewed

    Description

      DataNode logs as follow

      org.apache.hadoop.HadoopIllegalArgumentException: No enough valid inputs are provided, not recoverable
      at org.apache.hadoop.io.erasurecode.rawcoder.ByteBufferDecodingState.checkInputBuffers(ByteBufferDecodingState.java:119)
      at org.apache.hadoop.io.erasurecode.rawcoder.ByteBufferDecodingState.<init>(ByteBufferDecodingState.java:47)
      at org.apache.hadoop.io.erasurecode.rawcoder.RawErasureDecoder.decode(RawErasureDecoder.java:86)
      at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstructTargets(StripedBlockReconstructor.java:126)
      at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:97)
      at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60)
      at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
      at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      at java.lang.Thread.run(Thread.java:748)

      Block recovery always failed because of srcNodes in the wrong order

      Reproduce steps are:

      1. ec block (b0, b1, b2, b3, b4, b5, b6, b7, b8), b[0-8] are on dn[0-8], dn[0-3] are decommissioning
      2. dn[1-3] are decommissioned, dn0 are in decommissioning, ec block is [b0(decommissioning), b[1-3](decommissioned), b[4-8](live), b[0-3](live)]
      3. dn4 is crash, and b4 will be recovery, ec block is [b0(decommissioning), b[1-3](decommissioned), null, b[5-8](live), b[0-3](live)]

      We can see error log as above, and b4 is not recovery successfuly. Because srcNodes transfered to recovery datanode contains block [b0, b[5-8],b[0-3]], and datanode use [b0, b[5-8], b0](minRequiredSources Readers to reconstruct, minRequiredSources = Math.min(cellsNum, dataBlkNum)) to recovery the missing block.

      Attachments

        1. HDFS-14946-branch-3.2.001.patch
          8 kB
          Hui Fei
        2. HDFS-14946-branch-3.1.001.patch
          8 kB
          Hui Fei
        3. HDFS-14946.004.patch
          8 kB
          Hui Fei
        4. HDFS-14946.003.patch
          8 kB
          Hui Fei
        5. HDFS-14946.002.patch
          8 kB
          Hui Fei
        6. HDFS-14946.001.patch
          8 kB
          Hui Fei

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            ferhui Hui Fei
            ferhui Hui Fei
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment