[HDFS-5058] QJM should validate startLogSegment() more strictly - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Won't Fix
Affects Version/s: 2.1.0-beta, 3.0.0-alpha1
Fix Version/s: None
Component/s: qjm
Labels:
- BB2015-05-TBR

Target Version/s:

2.1.0-beta

Description

We've seen a small handful of times a case where one of the NNs in an HA cluster ends up with an fsimage checkpoint that falls in the middle of an edit segment. We're not sure yet how this happens, but one issue can happen as a result:

Node has fsimage_500. Cluster has edits_1-1000, edits_1001_inprogress
Node restarts, loads fsimage_500
Node wants to become active. It calls selectInputStreams(500). Currently, this API logs a WARN that 500 falls in the middle of the 1-1000 segment, but continues and returns no results.
Node calls startLogSegment(501).

Currently, the QJM will accept this (incorrectly). The node then crashes when it first tries to journal a real transaction, but it ends up leaving the edits_501_inprogress lying around, potentially causing more issues later.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HDFS-5098.patch
02/May/15 20:36
6 kB
Allen Wittenauer
hdfs-5058.txt
02/Aug/13 18:58
6 kB
Todd Lipcon

Activity

People

Assignee:: Todd Lipcon

Reporter:: Todd Lipcon

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 02/Aug/13 18:55

Updated:: 24/Apr/18 01:33

Resolved:: 24/Apr/18 01:33