[HDFS-3955] QJM: Make acceptRecovery() atomic - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: QuorumJournalManager (HDFS-3077)
Fix Version/s: QuorumJournalManager (HDFS-3077)
Component/s: ha
Labels:
None

Target Version/s:

QuorumJournalManager (HDFS-3077)
Hadoop Flags:

Reviewed

Description

Per one of the TODOs in Journal.java, there is currently a lack of atomicity in the acceptRecovery() code path. In particular, we have the following actions executed non-atomically:

Download a new edits_inprogress_N from some other node
Persist the paxos recovery file to disk.

If the JN crashes between these two steps, then we may be left in the state whereby the edits_inprogress file has different data than the Paxos data left over on the disk from a previous recovery attempt. This causes the next prepareRecovery() to fail with an AssertionError.

I discovered this by randomly injecting a fault between the two steps, and then running the randomized fault test on a cluster. This resulted in some AssertionErrors in the test logs.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

hdfs-3955.txt
19/Sep/12 03:13
25 kB
Todd Lipcon

Activity

People

Assignee:: Todd Lipcon

Reporter:: Todd Lipcon

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 19/Sep/12 02:48

Updated:: 19/Sep/12 18:57

Resolved:: 19/Sep/12 18:57