[ZOOKEEPER-2678] Large databases take a long time to regain a quorum - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 3.4.9, 3.5.2
Fix Version/s: 3.4.10, 3.5.3, 3.6.0
Component/s: server
Labels:
None

Description

I know this is long but please here me out.

I recently inherited a massive zookeeper ensemble. The snapshot is 3.4 GB on disk. Because of its massive size we have been running into a number of issues. There are lots of problems that we hope to fix with tuning GC etc, but the big one right now that is blocking us making a lot of progress on the rest of them is that when we lose a quorum because the leader left, for what ever reason, it can take well over 5 mins for a new quorum to be established. So we cannot tune the leader without risking downtime.

We traced down where the time was being spent and found that each server was clearing the database so it would be read back in again before leader election even started. Then as part of the sync phase each server will write out a snapshot to checkpoint the progress it made as part of the sync.

I will be putting up a patch shortly with some proposed changes in it.

Attachments

Issue Links

breaks

ZOOKEEPER-2845 Data inconsistency issue due to retain database in leader election

Resolved

causes

ZOOKEEPER-3023 Flaky test: org.apache.zookeeper.server.quorum.Zab1_0Test.testNormalFollowerRunWithDiff

Resolved

ZOOKEEPER-3911 Data inconsistency caused by DIFF sync uncommitted log

Closed

relates to

ZOOKEEPER-1674 There is no need to clear & load the database across leader election

Open

links to

GitHub Pull Request #157

GitHub Pull Request #158

GitHub Pull Request #159

(2 links to)

Activity

People

Assignee:: Robert Joseph Evans

Reporter:: Robert Joseph Evans

Votes:: 1 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 26/Jan/17 15:16

Updated:: 19/Sep/24 06:29

Resolved:: 14/Feb/17 18:06

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

0.5h