Uploaded image for project: 'ZooKeeper'
  1. ZooKeeper
  2. ZOOKEEPER-3757

Transaction log sync can take 20+ seconds after leader election when there is a large snapCount

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 3.5.6
    • None
    • leaderElection
    • None

    Description

      Short overview:

      If you have a large snapCount (we are using 10,000,000) you can end up with a very large transaction log (ours are between 1GB - 1.5 GB), which can cause the sync between a newly elected leader and it's followers to take 20+ seconds.  This stems from the code (FileTxnIterator.getStorageSize()) in most cases returning 0 even if the transaction log is 1GB.

       

      Long Explanation:

      A few years ago we had some trouble with our zookeeper cluster having many shortish (100-500ms) pauses during our peak traffic times.  These ended up resulting from the master taking a snap shot.  To solve this we upped the snapCount to 10,000,000 so that we weren't taking snapshots nearly as often.  We also made changes to reduce the size of our snapshots (from around 2.5 GB to ~500 MB).

      I don't remember what version of zookeeper we were using originally, but this was all working fine using 3.4.10, but we started to have problems when we upgraded to 3.5.6 around 3 months ago.  We have a fairly high transaction rate and thus end up hitting the zxid overflow about once a month, which will cause a leader election.  When we were on 3.4.10, this was fine because leader election and syncing would happen within 2-4 seconds, which was low enough for us to be able to basically ignore it.  However after we upgraded to 3.5.6 the pauses we saw took between 15 - 30 seconds which were unacceptable for us.

      For now to solve this I set zookeeper.forceSnapshotSync=true (yes, I know the comments say this is only supposed to be used for testing), which causes syncing using snapshots (only 10-50 MB) instead of the transaction log (1-1.5 GB).

       

      Technical details:

      I tried taking a look at the code and I think I know why this happens.  From what I learned, it looks like when a follower needs to sync with a leader on the leader LearnerHandler.syncFollower() gets called.  It goes through a big if statement, but at one point it will call db.getProposalsFromTxnLog(peerLastZxid, sizeLimit).  That peerLastZxid could be some very old zxid if the follower hadn't taken a snapshot in a long time (i.e. has a large snapCount) and the sizeLimit will generally be 0.33 * snapshot size (in my case around 10 MB).

      Inside of that getProposalsFromTxnLog it will create a TxnIterator and then call getStorageSize() on it.  The problem comes from the fact that this call to getStorageSize() will usually return with 0.  The reason that happens is because the FileTxnIterator class has a "current" log file that it is reading, this.logFile, and a list of files that it would still have to iterate through, this.storedFiles.  The getStroageSize() function though only looks at the storedFiles list, so if the iterator has one large transaction log as the "current" log file and nothing in the storedFiles list, then this method will return 0 even though there is a huge transaction log to sync.

      One other side affect of this problem is that even bouncing a follower can cause long (5-10 second) pauses as the leader will hold a read lock on the transaction log while it syncs up with the follower.

      While I know what the problem is I don't know what the best solution is.  I'm willing to work on the solution, but I would appreciate suggestions.  One possible solution would be to include the this.logFile in the getStorageSize() calculation, however this could cause the estimate to over estimate the amount of data that is in the iterator (possibly by a lot) and I don't know what the consequences of doing that is.  I'm not quite sure what is a good way to get an accurate estimate.

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            alextkaiser Alex Kaiser
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: