[HADOOP-8217] Edge case split-brain race in ZK-based auto-failover - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 0.24.0
Fix Version/s: None
Component/s: auto-failover, ha
Labels:
None

Target Version/s:

Auto Failover (HDFS-3042)

Description

As discussed in ~~HADOOP-8206~~, the current design for automatic failover has the following race:

ZKFC1 gets active lock
ZKFC1 is about to send transitionToActive() and machine freezes (eg GC pause + swapping)
ZKFC1 loses its ZK lock, ZKFC2 gets ZK lock
ZKFC2 calls transitionToStandby on NN1, and transitions NN2 to active
ZKFC1 wakes up from pause, calls transitionToActive(), now we have a bad situation

This is rare, since it requires ZKFC1 to freeze longer than its ZK session timeout, but worth fixing, since the results can be disastrous.

Attachments

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

hadoop-8217-testcase.txt
30/Mar/12 05:44
10 kB
Todd Lipcon

Issue Links

is related to

HADOOP-13515 Redundant transitionToActive call can cause a NameNode to crash

Open

relates to

HADOOP-8206 Common portion of ZK-based failover controller

Closed

HDFS-3042 Automatic failover support for NN HA

Closed

Activity

People

Assignee:: Todd Lipcon

Reporter:: Todd Lipcon

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 26/Mar/12 22:23

Updated:: 18/Aug/16 10:23