[HDFS-4811] race condition between 2 namenodes in standby that are trying to checkpoint with one another can delete or corrupt a good fsimage - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 2.1.0-beta, 3.0.0-alpha1
Fix Version/s: None
Component/s: ha
Labels:
None

Target Version/s:

2.1.0-beta

Description

The problem occurs under concurrent execution of the namenode running its own checkpoint in StandbyCheckpointer in thread 1 while also getting a checkpoint from a different namenode in GetImageServlet in thread 2. It is possible for thread 2 to finish writing the checkpoint to the directory, but then get suspended before it has a chance to rename it to its final destination as an fsimage file. Then, thread 1 wakes up and starts writing its own data to the checkpoint file. When thread 2 resumes, it then tries to rename the file that thread 1 still holds open for writing. Depending on OS, this either moves thread 1's incomplete checkpoint to fsimage, or it just outright deletes the existing good fsimage until thread 1 finishes writing and renames.

Attachments

Issue Links

duplicates

HDFS-3519 Checkpoint upload may interfere with a concurrent saveNamespace

Closed

relates to

HDFS-3602 Enhancements to HDFS for Windows Server and Windows Azure development and runtime environments

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Chris Nauroth

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 09/May/13 21:39

Updated:: 12/May/16 18:12

Resolved:: 09/May/13 22:43