Details
Description
DataNode logs as follow
org.apache.hadoop.HadoopIllegalArgumentException: No enough valid inputs are provided, not recoverable
at org.apache.hadoop.io.erasurecode.rawcoder.ByteBufferDecodingState.checkInputBuffers(ByteBufferDecodingState.java:119)
at org.apache.hadoop.io.erasurecode.rawcoder.ByteBufferDecodingState.<init>(ByteBufferDecodingState.java:47)
at org.apache.hadoop.io.erasurecode.rawcoder.RawErasureDecoder.decode(RawErasureDecoder.java:86)
at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstructTargets(StripedBlockReconstructor.java:126)
at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:97)
at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Block recovery always failed because of srcNodes in the wrong order
Reproduce steps are:
- ec block (b0, b1, b2, b3, b4, b5, b6, b7, b8), b[0-8] are on dn[0-8], dn[0-3] are decommissioning
- dn[1-3] are decommissioned, dn0 are in decommissioning, ec block is [b0(decommissioning), b[1-3](decommissioned), b[4-8](live), b[0-3](live)]
- dn4 is crash, and b4 will be recovery, ec block is [b0(decommissioning), b[1-3](decommissioned), null, b[5-8](live), b[0-3](live)]
We can see error log as above, and b4 is not recovery successfuly. Because srcNodes transfered to recovery datanode contains block [b0, b[5-8],b[0-3]], and datanode use [b0, b[5-8], b0](minRequiredSources Readers to reconstruct, minRequiredSources = Math.min(cellsNum, dataBlkNum)) to recovery the missing block.