[HDFS-903] NN should verify images and edit logs on startup - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.22.0
Component/s: namenode
Labels:
None

Hadoop Flags:

Incompatible change, Reviewed
Release Note:
Store fsimage MD5 checksum in VERSION file. Validate checksum when loading a fsimage. Layout version bumped.

Description

I was playing around with corrupting fsimage and edits logs when there are multiple dfs.name.dirs specified. I noticed that:

As long as your corruption does not make the image invalid, eg changes an opcode so it's an invalid opcode HDFS doesn't notice and happily uses a corrupt image or applies the corrupt edit.
If the first image in dfs.name.dir is "valid" it replaces the other copies in the other name.dirs, even if they are different, with this first image, ie if the first image is actually invalid/old/corrupt metadata than you've lost your valid metadata, which can result in data loss if the namenode garbage collects blocks that it thinks are no longer used.

How about we maintain a checksum as part of the image and edit log and check those on startup and refuse to startup if they are different. Or at least provide a configuration option to do so if people are worried about the overhead of maintaining checksums of these files. Even if we assume dfs.name.dir is reliable storage this guards against operator errors.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

trunkChecksumImage.patch
26/Oct/10 23:06
19 kB
Hairong Kuang
trunkChecksumImage1.patch
01/Nov/10 20:25
23 kB
Hairong Kuang
trunkChecksumImage2.patch
04/Nov/10 06:38
22 kB
Hairong Kuang
trunkChecksumImage3.patch
05/Nov/10 17:58
24 kB
Hairong Kuang
trunkChecksumImage4.patch
07/Nov/10 22:48
24 kB
Hairong Kuang

Issue Links

blocks

HDFS-1481 NameNode should validate fsimage before rolling

Resolved

breaks

HDFS-1500 TestOfflineImageViewer failing on trunk

Closed

contains

HDFS-52 SecondaryNameNode should validate the downloaded files.

Resolved

is blocked by

HADOOP-7009 MD5Hash provides a public factory method that creates an instance of MessageDigest

Closed

is related to

HDFS-1496 TestStorageRestore is failing after HDFS-903 fix

Resolved

relates to

HDFS-1382 A transient failure with edits log and a corrupted fstime together could lead to a data loss

Resolved

HDFS-1602 NameNode storage failed replica restoration is broken

Closed

HDFS-1458 Improve checkpoint performance by avoiding unnecessary image downloads

Resolved

(3 relates to)

Activity

People

Assignee:: Hairong Kuang

Reporter:: Eli Collins

Votes:: 0 Vote for this issue

Watchers:: 16 Start watching this issue

Dates

Created:: 15/Jan/10 23:27

Updated:: 19/Jul/14 00:02

Resolved:: 08/Nov/10 06:52