Details
-
Bug
-
Status: Closed
-
Blocker
-
Resolution: Fixed
-
0.22.0
-
None
-
None
-
Reviewed
Description
This one is tough to describe, but essentially the following order of events is seen to occur:
- A client marks one replica of a block to be corrupt by telling the NN about it
- Replication is then scheduled to make a new replica of this node
- The replication completes, such that there are now 3 good replicas and 1 corrupt replica
- The DN holding the corrupt replica sends a block report. Rather than telling this DN to delete the node, the NN instead marks this as a new good replica of the block, and schedules deletion on one of the good replicas.
I don't know if this is a dataloss bug in the case of 1 corrupt replica with dfs.replication=2, but it seems feasible. I will attach a debug log with some commentary marked by '============>', plus a unit test patch which I can get to reproduce this behavior reliably. (it's not a proper unit test, just some edits to an existing one to show it)
Attachments
Attachments
Issue Links
- relates to
-
HDFS-875 NameNode incorretly handles corrupt replicas
- Resolved