[HDFS-3906] QJM: quorum timeout on failover with large log segment - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: QuorumJournalManager (HDFS-3077)
Fix Version/s: QuorumJournalManager (HDFS-3077)
Component/s: None
Labels:
None

Hadoop Flags:

Reviewed

Description

In doing some stress tests, I ran into an issue with failover if the current edit log segment written by the old active is large. With a 327MB log segment containing 6.4M transactions, the JN took ~11 seconds to read and validate it during the recovery step. This was longer than the 10 second timeout for createNewEpoch, which caused the recovery to fail.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

hdfs-3906.txt
10/Sep/12 23:13
24 kB
Todd Lipcon

Activity

People

Assignee:: Todd Lipcon

Reporter:: Todd Lipcon

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 07/Sep/12 21:23

Updated:: 11/Sep/12 06:32

Resolved:: 11/Sep/12 06:32