Uploaded image for project: 'ZooKeeper'
  1. ZooKeeper
  2. ZOOKEEPER-3972

Convergence fail when a follower tries to resync with a leader having incomplete commitlog

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 3.5.8
    • Fix Version/s: None
    • Component/s: server
    • Labels:
      None

      Description

      It is possible that a leader may have incomplete commitlog because it resync'ed with the old leader via SNAPSHOT replication.

      Then, a follower may try to resync with the leader, but because there may be some transactions the follower missed earlier and the leader does not have in its commitlog.

      They decided to use txnlog + commitlog to resync. However, this will lead to convergence failure because the leader does not send the missing transactions that are not in its commitlog.

      Here is the abstract step to reproduce the bug, and I attached the patch with the test case that can reproduce the bug.

      Initially, node A,B,C are all sync'ed.
      1. Node A crashes; setData 0x11 on B and C
      2. Node B and C crash
      3. Node A and B restart
      4. Node A crashes; setData 0x21 on B
      5. Node B crashes
      6. Node B and C restart
      7. Node C crashes; setData 0x32 on B
      8. Node A and C restart
      9. Node B restarts

      At step 6, C is a follower getting a snapshot from B, and C does not have the transaction 0x21 in its commitlog (only in the snapshot).

      At step 8, C is the leader which does not have 0x21 in its commitlog, which A never gets.

      In the end, 0x21 only exists on B and C, but not on A.

      I think the solution would be made to LearnerHandler's  syncFollower method as follows:
      1. Check the last transaction it has in its txnlog + commitlog
      2. If it is more recent than what it has in its txnlog + commitlog, then it should use Snapshot
      3. Otherwise, continue with txnlog + commitlog replication

      I attached a patch containing the proposed fix.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              anaud anaud
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated: