Uploaded image for project: 'ZooKeeper'
  1. ZooKeeper
  2. ZOOKEEPER-4444

Follower doesn't get synchronized after process restart

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.6.3
    • None
    • None
    • None

    Description

      Hi folks, I've got an issue with 3.6.3.
      I can't provide a simple test, because it seems to depend on timing in a cluster environment, but I tried to reduce the scenario as far as possible:

      • zookeeper cluster with 5 nodes, all of them Followers (no Observers)
      • start some parallel threads which do some writes in the cluster (e.g. create/delete znodes)
      • kill one of the zookeeper processes, let's say on node X (where node X is not the Leader)
      • restart zookeeper process on node X
      • wait a few seconds
      • kill zookeeper process on node X again
      • restart zookeeper process on node X again

      In some cases (every 3-4 runs) I see the following in the log of node X:

      After first restart of node X:

       WARN  persistence.FileTxnLog           - Current zxid 4294968525 is <= 4294969524 for 15
       WARN  persistence.FileTxnLog           - Current zxid 4294968526 is <= 4294969524 for 15
       WARN  persistence.FileTxnLog           - Current zxid 4294968527 is <= 4294969524 for 15
       ... (this kind of WARN is repeated some hundred times)
       WARN  quorum.SendAckRequestProcessor   - Closing connection to leader, exception during packet send java.net.SocketException: Socket closed ...
       ... (this kind of WARN is repeated some hundred times)
      

      After second restart of node X:

       ERROR persistence.FileTxnSnapLog       - 4294970146(highestZxid) > 4294969147(next log) for type 2
       WARN  server.DataTree                  - Message:Digests are not matching. Value is Zxid. Value:4294969147
       ERROR server.DataTree                  - First digest mismatch on txn: 360466402305310720,3870,4294969147,1639258399998,2
      , ...
      , expected digest is 2,1365261838770
      , actual digest is 1098406565142, 
       ERROR persistence.FileTxnSnapLog       - 4294970146(highestZxid) > 4294969148(next log) for type 2
       ERROR persistence.FileTxnSnapLog       - 4294970146(highestZxid) > 4294969149(next log) for type 5
       ERROR persistence.FileTxnSnapLog       - 4294970146(highestZxid) > 4294969150(next log) for type 2
       ... (this kind of ERROR is repeated some hundred times)
      

      And afterwards (in the actual application), zookepeer on node X seems to have a different view of the cluster state and doesn't get synchronized, at least for a few hours.
      This e.g. leads to phantom reads of znodes that were deleted a long time ago.
      (The resulting behaviour looks a little bit similar as described in ZOOKEEPER-3911.)

      This does not happen with zookeeper 3.6.2 !
      (at least I can't reproduce it with this version)

      Attachments

        Activity

          People

            Unassigned Unassigned
            andreasweber Andreas Weber
            Votes:
            3 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: