[HDFS-387] Corrupted blocks leading to job failures - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Not A Problem
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Description

On one of our clusters we ended up with 11 singly-replicated corrupted blocks (checksum errors) such that jobs were failing because of no live blocks available.

fsck reports the system as healthy, although it is not.

I argue that fsck should have an option to check whether under-replicated blocks are okay.

Even better, the namenode should automatically check under-replicated blocks with repeated replication failures for corruption and list them somewhere on the GUI. And for checksum errors, there should be an option to undo the corruption and recompute the checksums.

Question: Is it at all probable that two or more replications of a block have checksum errors? If not, then we could reduce the checking to singly-replicated blocks.

Attachments

Issue Links

relates to

HADOOP-3013 fsck to show (checksum) corrupted files

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Christian Kunz

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 14/May/08 21:18

Updated:: 02/Mar/11 23:59

Resolved:: 02/Mar/11 23:59