[ZOOKEEPER-4220] Potential redundant connection attempts during leader election - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 3.5.9, 3.6.2
Fix Version/s: 3.5.10, 3.6.3, 3.7.0, 3.8.0
Component/s: server
Labels:
- pull-request-available

Description

We've seen a few failures or long delays in electing a new leader when the previous one has a hard host reset (as opposed to just the service process down, since connections don't need to wait for timeout there). Symptoms are similar to https://issues.apache.org/jira/browse/ZOOKEEPER-2164. Reducing cnxTimeout from 5 to 1.5 seconds makes the problem much less frequent, but doesn't fix it completely. We are still using an old ZooKeeper version (3.5.5), and the new async connect feature will presumably avoid it.

But we noticed a pattern of twice the expected number of connection attempts to the same downed instance in the log, and it appears to be due to a code glitch in QuorumCnxManager.java:

synchronized void connectOne(long sid) {
    ...
    if (lastCommittedView.containsKey(sid)) {
        knownId = true;
        if (connectOne(sid, lastCommittedView.get(sid).electionAddr))
            return;
    }
    if (lastSeenQV != null && lastProposedView.containsKey(sid)
            && (!knownId || (lastProposedView.get(sid).electionAddr !=   <----
            lastCommittedView.get(sid).electionAddr))) {
        knownId = true;
        if (connectOne(sid, lastProposedView.get(sid).electionAddr))
            return;
    }

Comparing electionAddrs should be done with !equals presumably, otherwise connectOne will be invoked an extra time even in the common case when the addresses do match.

The code around it has changed recently, but the check itself still exists at the top of master. It might not matter as much with the async connects, but perhaps it helps even then.

Attachments

Issue Links

links to

GitHub Pull Request #1615

GitHub Pull Request #1630

GitHub Pull Request #1631

Activity

People

Assignee:: Mate Szalay-Beko

Reporter:: Alex Mirgorodskiy

Votes:: 1 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 23/Feb/21 18:26

Updated:: 28/Mar/21 08:54

Resolved:: 06/Mar/21 20:54

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

3h 50m