[ZOOKEEPER-1865] Fix retry logic in Learner.connectToLeader() - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Reopened
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: server
Labels:
None

Description

We discovered a long leader election time today in one of our prod ensemble.

Here is the description of the event.

Before the old leader goes down, it is able to announce notification message. So 3 out 5 (including the old leader) elected the old leader to be a new leader for the next epoch. While, the old leader is being rebooted, 2 other machines are trying to connect to the old leader. So the quorum couldn't form until those 2 machines give up and move to the next round of leader election.

This is because Learner.connectToLeader() use a simple retry logic. The contract for this method is that it should never spend longer that initLimit trying to connect to the leader. In our outage, each sock.connect() is probably blocked for initLimit and it is called 5 times.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

ZOOKEEPER-1865-testfix.patch
14/Mar/15 23:42
7 kB
Camille Fournier
ZOOKEEPER-1865-nanoTime.patch
31/Jan/15 00:13
6 kB
Jared Cantwell
ZOOKEEPER-1865.patch
08/Jul/14 20:25
4 kB
Edward Carter

Activity

People

Assignee:: Edward Carter

Reporter:: Thawan Kooburat

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 22/Jan/14 00:51

Updated:: 03/Feb/22 08:50