Uploaded image for project: 'ZooKeeper'
  1. ZooKeeper
  2. ZOOKEEPER-2412

leader zk out of memory, and leader db lastZxid is not update when process set data.

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: 3.4.6
    • Fix Version/s: 3.4.8
    • Component/s: None
    • Labels:
      None
    • Environment:

      3 zk, the follower (myid=6) server is injected into the network delay fault for test, so this follower(id=6) data is behind leader, leader need send SNAP to this follower every time.

      Description

      the time begin to have problem is 20:59:47

      (1) out of memory log:

      2016-03-24 23:01:21,355 [myid:4] - INFO [LearnerHandler-/192.168.50.26:35112:LearnerHandler@330] - Follower sid: 6 : info : org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer@5cf112b0
      2016-03-24 23:01:21,355 [myid:4] - INFO [LearnerHandler-/192.168.50.26:35112:LearnerHandler@385] - Synchronizing with Follower sid: 6 maxCommittedLog=0x9000033a1 minCommittedLog=0x9000031ad peerLastZxid=0x90000280a
      2016-03-24 23:01:21,355 [myid:4] - WARN [LearnerHandler-/192.168.50.26:35112:LearnerHandler@446] - Unhandled proposal scenario
      2016-03-24 23:01:21,355 [myid:4] - INFO [LearnerHandler-/192.168.50.26:35112:LearnerHandler@462] - Sending SNAP
      2016-03-24 23:01:21,893 [myid:4] - INFO [NIOServerCxn.Factory:/192.168.50.24:10540:NIOServerCnxnFactory@207] - Current connection (from /192.168.50.22 Cnxns = 4; totalCnxns = 15)
      2016-03-24 23:01:22,625 [myid:4] - WARN [NIOServerCxn.Factory:/192.168.50.24:10540:ZooKeeperServer@832] - Connection request from old client /192.168.50.22:49695; will be dropped if server is in r-o mode
      2016-03-24 23:01:23,283 [myid:4] - INFO [QuorumPeer[myid=4]/192.168.50.24:10540:Leader@493] - Shutting down
      2016-03-24 23:01:24,102 [myid:4] - INFO [QuorumPeer[myid=4]/192.168.50.24:10540:Leader@499] - Shutdown called
      2016-03-24 23:01:24,040 [myid:4] - INFO [SessionTracker:ZooKeeperServer@347] - Expiring session 0x453a6dc5b7a007e, timeout of 3500ms exceeded

      Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "NIOServerCxn.Factory:/192.168.50.24:10540"
      2016-03-24 23:01:25,001 [myid:4] - WARN [QuorumPeer[myid=4]/192.168.50.24:10540:QuorumPeer@827] - QuorumPeer main thread exited

      Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "LearnerHandler-/192.168.50.26:35112"

      Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "QuorumPeer[myid=4]/192.168.50.24:10540"
      2016-03-24 23:01:24,227 [myid:4] - ERROR [LearnerHandler-/192.168.50.26:35112:NIOServerCnxnFactory$1@44] - Thread LearnerHandler Socket[addr=/192.168.50.26,port=35112,localport=10550] tickOfNextAckDeadline:38310 synced?:true queuedPacketLength:7910 died
      2016-03-24 23:01:28,492 [myid:4] - INFO [main:QuorumPeerMain@93] - Exiting normally

      (2) this leader is very strange, db lastzxid not update, log as follow (after grep "Synchronizing with"), we can see max commit zxid from leader is not update any more. From the beginning of 20:59:47, leader lastZxid not update.

      2016-03-24 20:56:10,266 [myid:4] - INFO [LearnerHandler-/192.168.50.26:48439:LearnerHandler@385] - Synchronizing with Follower sid: 6 maxCommittedLog=0x90000333f minCommittedLog=0x90000314b peerLastZxid=0x90000280a
      2016-03-24 20:57:59,203 [myid:4] - INFO [LearnerHandler-/192.168.50.26:46956:LearnerHandler@385] - Synchronizing with Follower sid: 6 maxCommittedLog=0x900003398 minCommittedLog=0x9000031a4 peerLastZxid=0x90000280a
      2016-03-24 20:59:47,928 [myid:4] - INFO [LearnerHandler-/192.168.50.26:20601:LearnerHandler@385] - Synchronizing with Follower sid: 6 maxCommittedLog=0x9000033a1 minCommittedLog=0x9000031ad peerLastZxid=0x90000280a
      2016-03-24 21:01:26,475 [myid:4] - INFO [LearnerHandler-/192.168.50.26:29622:LearnerHandler@385] - Synchronizing with Follower sid: 6 maxCommittedLog=0x9000033a1 minCommittedLog=0x9000031ad peerLastZxid=0x90000280a
      2016-03-24 21:03:16,552 [myid:4] - INFO [LearnerHandler-/192.168.50.26:35717:LearnerHandler@385] - Synchronizing with Follower sid: 6 maxCommittedLog=0x9000033a1 minCommittedLog=0x9000031ad peerLastZxid=0x90000280a
      2016-03-24 21:03:44,427 [myid:4] - INFO [LearnerHandler-/192.168.50.26:48197:LearnerHandler@385] - Synchronizing with Follower sid: 6 maxCommittedLog=0x9000033a1 minCommittedLog=0x9000031ad peerLastZxid=0x90000280a
      2016-03-24 21:05:01,125 [myid:4] - INFO [LearnerHandler-/192.168.50.26:57826:LearnerHandler@385] - Synchronizing with Follower sid: 6 maxCommittedLog=0x9000033a1 minCommittedLog=0x9000031ad peerLastZxid=0x90000280a
      2016-03-24 21:06:54,187 [myid:4] - INFO [LearnerHandler-/192.168.50.26:30137:LearnerHandler@385] - Synchronizing with Follower sid: 6 maxCommittedLog=0x9000033a1 minCommittedLog=0x9000031ad peerLastZxid=0x90000280a
      2016-03-24 21:07:42,780 [myid:4] - INFO [LearnerHandler-/192.168.50.26:24255:LearnerHandler@385] - Synchronizing with Follower sid: 6 maxCommittedLog=0x9000033a1 minCommittedLog=0x9000031ad peerLastZxid=0x90000280a
      2016-03-24 21:08:41,279 [myid:4] - INFO [LearnerHandler-/192.168.50.26:40909:LearnerHandler@385] - Synchronizing with Follower sid: 6 maxCommittedLog=0x9000033a1 minCommittedLog=0x9000031ad peerLastZxid=0x90000280a
      2016-03-24 21:10:23,137 [myid:4] - INFO [LearnerHandler-/192.168.50.26:64166:LearnerHandler@385] - Synchronizing with Follower sid: 6 maxCommittedLog=0x9000033a1 minCommittedLog=0x9000031ad peerLastZxid=0x90000280a
      2016-03-24 21:11:50,003 [myid:4] - INFO [LearnerHandler-/192.168.50.26:56070:LearnerHandler@385] - Synchronizing with Follower sid: 6 maxCommittedLog=0x9000033a1 minCommittedLog=0x9000031ad peerLastZxid=0x90000280a
      2016-03-24 21:12:11,956 [myid:4] - INFO [LearnerHandler-/192.168.50.26:41423:LearnerHandler@385] - Synchronizing with Follower sid: 6 maxCommittedLog=0x9000033a1 minCommittedLog=0x9000031ad peerLastZxid=0x90000280a
      2016-03-24 21:13:08,286 [myid:4] - INFO [LearnerHandler-/192.168.50.26:26757:LearnerHandler@385] - Synchronizing with Follower sid: 6 maxCommittedLog=0x9000033a1 minCommittedLog=0x9000031ad peerLastZxid=0x90000280a
      2016-03-24 21:13:59,960 [myid:4] - INFO [LearnerHandler-/192.168.50.26:62785:LearnerHandler@385] - Synchronizing with Follower sid: 6 maxCommittedLog=0x9000033a1 minCommittedLog=0x9000031ad peerLastZxid=0x90000280a
      2016-03-24 21:15:41,103 [myid:4] - INFO [LearnerHandler-/192.168.50.26:53141:LearnerHandler@385] - Synchronizing with Follower sid: 6 maxCommittedLog=0x9000033a1 minCommittedLog=0x9000031ad peerLastZxid=0x90000280a
      2016-03-24 21:16:11,125 [myid:4] - INFO [LearnerHandler-/192.168.50.26:39551:LearnerHandler@385] - Synchronizing with Follower sid: 6 maxCommittedLog=0x9000033a1 minCommittedLog=0x9000031ad peerLastZxid=0x90000280a
      2016-03-24 21:17:25,541 [myid:4] - INFO [LearnerHandler-/192.168.50.26:24638:LearnerHandler@385] - Synchronizing with Follower sid: 6 maxCommittedLog=0x9000033a1 minCommittedLog=0x9000031ad peerLastZxid=0x90000280a
      2016-03-24 21:18:25,039 [myid:4] - INFO [LearnerHandler-/192.168.50.26:54723:LearnerHandler@385] - Synchronizing with Follower sid: 6 maxCommittedLog=0x9000033a1 minCommittedLog=0x9000031ad peerLastZxid=0x90000280a
      2016-03-24 21:19:04,148 [myid:4] - INFO [LearnerHandler-/192.168.50.26:37450:LearnerHandler@385] - Synchronizing with Follower sid: 6 maxCommittedLog=0x9000033a1 minCommittedLog=0x9000031ad peerLastZxid=0x90000280a

      (3) we can see leader Snapshotting to the same file in different time, but receive different TxnZxid(0x900003b70 and 0x9000049dd), this show leader has not been updated lastZxid in db.

      Snapshotting 1:
      2016-03-24 21:33:27,214 [myid:4] - INFO [Snapshot Thread:FileTxnSnapLog@253] - Snapshotting: 0x9000033a1 to /opt/dsware/agent/zk/data/version-2/snapshot.9000033a1
      2016-03-24 21:33:27,333 [myid:4] - INFO [SyncThread:4:FileTxnLog@199] - Creating new log file: log.900003b70

      Snapshotting 2:
      2016-03-24 22:41:26,601 [myid:4] - INFO [Snapshot Thread:FileTxnSnapLog@253] - Snapshotting: 0x9000033a1 to /opt/dsware/agent/zk/data/version-2/snapshot.9000033a1
      2016-03-24 22:41:26,662 [myid:4] - INFO [SyncThread:4:FileTxnLog@199] - Creating new log file: log.9000049dd

      (4) finally, this node(leader server) zxid is behind zk c client, log as follow:

      2016-03-24 23:00:53,712 [myid:4] - WARN [NIOServerCxn.Factory:/192.168.50.24:10540:ZooKeeperServer@832] - Connection request from old client /192.168.50.23:35043; will be dropped if server is in r-o mode
      2016-03-24 23:00:53,713 [myid:4] - INFO [NIOServerCxn.Factory:/192.168.50.24:10540:ZooKeeperServer@851] - Refusing session request for client /192.168.50.23:35043 as it has seen zxid 0x900004e1f our last zxid is 0x9000033a1 client must try another server

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              Yongcheng Yongcheng Liu
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: