[HDFS-1382] A transient failure with edits log and a corrupted fstime together could lead to a data loss - ASF JIRA

Log work

Agile Board

Rank to Top

Rank to Bottom

Attach files

Attach Screenshot

Bulk Copy Attachments

Bulk Move Attachments

Voters

Watch issue

Watchers

Create sub-task

Convert to sub-task

Move

Link

Clone

Labels

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: None
Component/s: namenode
Labels:
None

Description

We experienced a data loss situation that due to double failures.
One is transient disk failure with edits logs and the other is corrupted fstime.

Here is the detail:

1. NameNode has 2 edits directory (say edit0 and edit1)

2. During an update to edit0, there is a transient disk failure,
making NameNode bump the fstime and mark edit0 as stale
and continue working with edit1.

3. NameNode is shut down. Now, and unluckily fstime in edit0
is corrupted. Hence during NameNode startup, the log in edit0
is replayed, hence data loss.

This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (thanhdo@cs.wisc.edu) and
Haryadi Gunawi (haryadi@eecs.berkeley.edu)