Uploaded image for project: 'ZooKeeper'
  1. ZooKeeper
  2. ZOOKEEPER-4766

Ensure leader election time does not unnecessarily scale with tree size due to snapshotting



    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.5.9, 3.8.3
    • 3.5.9, 3.8.3
    • leaderElection
    • General behavior, should occur in all environments


      Hi ZK community, this is regarding a fix for a behavior that is causing the leader election time to unnecessarily scale with the amount of data in the ZK data tree.

      tl;dr: During leader election, the leader always saves a snapshot when loading its data tree. This snapshot seems unnecessary, even in the case where the leader needs to send an updated SNAP to a learner, since it serializes the tree before sending anyway. Snapshotting slows down leader election and increases ZK downtime significantly as more data is stored in the tree. This improvement is to avoid taking a snapshot so that this unnecessary downtime is avoided.

      During leader election, when the data is loaded by the tentatively elected (i.e. pre-finalized quorum) leader server, a snapshot of the tree is always taken. The loadData method is called from multiple places, but specifically in the context of leader election, it seems like the snapshotting step is unnecessary for the leader when loading data:

      • Because it has loaded the tree at this point, we know that if the leader were to go down again, it would still be able to recover back to the current state at which we are snapshotting without using the snapshot that we are taking in loadData()
      • There are no ongoing transactions until leader election is completed and the ZK ensemble is back up, so no data would be lost after the point at which the data tree is loaded
      • Once the ensemble is healthy and the leader is handling transactions again, any new transactions are being logged and when needed the log is being rolled over when needed anyway, so if the leader is recovering from a failure, the snapshot taken during loadData() does not afford us any additional benefits over the initial snapshot (if it existed) and transaction log that the leader used to load its data from in loadData()
      • When the leader is deciding to send a SNAP or a DIFF to a learner, a SNAP is serialized and sent if and only if it is needed. The snapshot taken in loadData() again does not seem to be beneficial here.

      The PR for this fix only skips this snapshotting step in loadData() during leader election. The behavior of the function remains the same for other usages. With this change, during leader election the data tree would only be serialized when sending a SNAP to a learner. In other scenarios, no data tree serialization would be needed at all. In both cases, there is a significant in the time spent in leader election.

      If my understanding of any of this is incorrect, or if I'm failing to consider some other aspect of the process, please let me know. The PR for the change can also be changed to enable/disable this behavior via a java property.


        Issue Links



              Unassigned Unassigned
              rishabhr Rishabh Rai
              0 Vote for this issue
              1 Start watching this issue



                Time Tracking

                  Original Estimate - 24h
                  Remaining Estimate - 23h 50m
                  23h 50m
                  Remaining Estimate - 23h 50m