Uploaded image for project: 'ZooKeeper'
  1. ZooKeeper
  2. ZOOKEEPER-2678

Large databases take a long time to regain a quorum

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 3.4.9, 3.5.2
    • 3.4.10, 3.5.3, 3.6.0
    • server
    • None

    Description

      I know this is long but please here me out.

      I recently inherited a massive zookeeper ensemble. The snapshot is 3.4 GB on disk. Because of its massive size we have been running into a number of issues. There are lots of problems that we hope to fix with tuning GC etc, but the big one right now that is blocking us making a lot of progress on the rest of them is that when we lose a quorum because the leader left, for what ever reason, it can take well over 5 mins for a new quorum to be established. So we cannot tune the leader without risking downtime.

      We traced down where the time was being spent and found that each server was clearing the database so it would be read back in again before leader election even started. Then as part of the sync phase each server will write out a snapshot to checkpoint the progress it made as part of the sync.

      I will be putting up a patch shortly with some proposed changes in it.

      Attachments

        Issue Links

          Activity

            People

              revans2 Robert Joseph Evans
              revans2 Robert Joseph Evans
              Votes:
              1 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 0.5h
                  0.5h