[HDFS-3982] report failed replications in DN heartbeat - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 2.0.2-alpha
Fix Version/s: None
Component/s: datanode
Labels:
None

Description

From ~~HDFS-3931~~:

The test corrupts 2/3 replicas.
client reports a bad block.
NN asks a DN to re-replicate, and randomly picks the other corrupt replica.
DN notices the incoming replica is corrupt and reports it as a bad block, but does not inform the NN that re-replication failed.
NN keeps the block on pendingReplications.
BP scanner wakes up on both DNs with corrupt blocks, both report corruption. NN reports both as duplicates, one from the client and one from the DN report above.
since block is on pendingReplications, NN does not schedule another replication.

Todd wrote:
I can think of a few ways to fix this:
...
2) Add a field to the DN heartbeat which reports back a failed replication for a given block. The NN would use this to decrement the pendingReplication count, which would cause a new replication attempt to be made if it was still under-replicated.

This jira tracks implementing the DN heartbeat replication failure report.

Attachments

Activity

People

Assignee:: Andy Isaacson

Reporter:: Andy Isaacson

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 27/Sep/12 02:24

Updated:: 27/Sep/12 17:56