We are tried the EC function on 80 node cluster with hadoop 3.1.1, we hit the same scenario as you said https://issues.apache.org/jira/browse/HDFS-8881. Following are our testing steps, hope it can helpful.(following DNs have the testing internal blocks)
- we customized a new 10-2-1024k policy and use it on a path, now we have 12 internal block(12 live block)
- decommission one DN, after the decommission complete. now we have 13 internal block(12 live block and 1 decommission block)
- then shutdown one DN which did not have the same block id as 1 decommission block, now we have 12 internal block(11 live block and 1 decommission block)
- after wait for about 600s (before the heart beat come) commission the decommissioned DN again, now we have 12 internal block(11 live block and 1 duplicate block)
- Then the EC is not reconstruct the missed block
We think this is a critical issue for using the EC function in a production env. Could you help? Thanks a lot!