Details
Description
We have got this problem.
- A job sends truncate to namenode, and the block recovery goes.
- DataNode D is timeout while it connects another datanode (60s), so block recovery costs 60+s
- A job tails, and B job starts and it sends truncate to namenode. New recoveryId generates during recovery lease.
- DataNode D commitBlockSynchronization and get errors "does not match current recovery id"
So truncate will not complete forever. Datanode D has replica with new length and two other datanodes have replica old length.
DN has the error messages "Inconsistent size of finalized replicas"
the related code is in BlockRecoveryWorker.java
for (BlockRecord r : syncList) { assert r.rInfo.getNumBytes() > 0 : "zero length replica"; ReplicaState rState = r.rInfo.getOriginalReplicaState(); if (rState.getValue() < bestState.getValue()) { bestState = rState; } if(rState == ReplicaState.FINALIZED) { if (finalizedLength > 0 && finalizedLength != r.rInfo.getNumBytes()) { throw new IOException("Inconsistent size of finalized replicas. " + "Replica " + r.rInfo + " expected size: " + finalizedLength); } finalizedLength = r.rInfo.getNumBytes(); } }
Attachments
Issue Links
- links to