[HDFS-900] Corrupt replicas are not tracked correctly through block report from DN - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Blocker
Resolution: Fixed
Affects Version/s: 0.22.0
Fix Version/s: 0.22.0
Component/s: None
Labels:
None

Hadoop Flags:

Reviewed

Description

This one is tough to describe, but essentially the following order of events is seen to occur:

A client marks one replica of a block to be corrupt by telling the NN about it
Replication is then scheduled to make a new replica of this node
The replication completes, such that there are now 3 good replicas and 1 corrupt replica
The DN holding the corrupt replica sends a block report. Rather than telling this DN to delete the node, the NN instead marks this as a new good replica of the block, and schedules deletion on one of the good replicas.

I don't know if this is a dataloss bug in the case of 1 corrupt replica with dfs.replication=2, but it seems feasible. I will attach a debug log with some commentary marked by '============>', plus a unit test patch which I can get to reproduce this behavior reliably. (it's not a proper unit test, just some edits to an existing one to show it)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

log-commented
14/Jan/10 00:42
34 kB
Todd Lipcon
reportCorruptBlock.patch
03/Feb/11 09:50
2 kB
Konstantin Shvachko
to-reproduce.patch
14/Jan/10 00:42
6 kB
Todd Lipcon

Issue Links

relates to

HDFS-875 NameNode incorretly handles corrupt replicas

Resolved

Activity

People

Assignee:: Konstantin Shvachko

Reporter:: Todd Lipcon

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 14/Jan/10 00:41

Updated:: 12/Dec/11 06:19

Resolved:: 04/Feb/11 04:12