ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-919

Ephemeral nodes remains in one of ensemble after deliberate SIGKILL

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Duplicate
    • Affects Version/s: 3.3.1
    • Fix Version/s: 3.3.3, 3.4.0
    • Component/s: server
    • Labels:
      None
    • Environment:

      Linux CentOS 5.3 64bit, JDK 1.6.0-22
      SLES 11

      Description

      I was testing stability of Zookeeper ensemble for production deployment. Three node ensemble cluster configuration.
      In a loop, I kill/restart three Zookeeper clients that created one ephemeral node each, and at the same time,
      I killed Java process on one of ensemble (dont' know if it was a leader or not). Then I restarted Zookeeper on the server,

      It turns out that on two zookeeper ensemble servers, all the ephemeral nodes are gone (it should), but on the newly started
      Zookeeper server, the two old ephemeral nodes stayed. The zookeeper didn't restart in standalone mode since new ephemeral
      nodes gets created on all ensemble servers.
      I captured the log.

      2010-11-04 17:48:50,201 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:17288:NIOServerCnxn$Factory@250] - Accepted socket connection from /10.25.131.21:11191
      2010-11-04 17:48:50,202 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:17288:NIOServerCnxn@776] - Client attempting to establish new session at /10.25.131.21:11191
      2010-11-04 17:48:50,203 - INFO [CommitProcessor:1:NIOServerCnxn@1579] - Established session 0x12c160c31fc000b with negotiated timeout 30000 for client /10.25.131.21:11191
      2010-11-04 17:48:50,206 - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:17288:NIOServerCnxn@633] - EndOfStreamException: Unable to read additional data from client sessionid 0x12c160c31fc000b, likely client has closed socket
      2010-11-04 17:48:50,207 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:17288:NIOServerCnxn@1434] - Closed socket connection for client /10.25.131.21:11191 which had sessionid 0x12c160c31fc000b
      2010-11-04 17:48:50,207 - ERROR [CommitProcessor:1:NIOServerCnxn@444] - Unexpected Exception:
      java.nio.channels.CancelledKeyException
      at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:55)
      at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:59)
      at org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:417)
      at org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1508)
      at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:367)
      at org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73)

      1. zk.patch
        2 kB
        Vishal Kher
      2. logs3.tar.gz
        62 kB
        Vishal Kher
      3. logs2.tar.gz
        128 kB
        Vishal Kher
      4. logs.tar.gz
        510 kB
        Vishal Kher

        Issue Links

          Activity

          Hide
          Vishal Kher added a comment -

          1. Create 3 node cluster
          #Tue Dec 07 03:08:43 UTC 2010
          server.2=10.150.27.112\:2888\:3888
          server.1=10.150.27.111\:2888\:3888
          server.0=10.150.27.110\:2888\:3888
          initLimit=5
          syncLimit=2
          snapCount=10000
          clientPort=2181
          tickTime=2000
          dataDir=/var/zookeeper

          2. run attached test.sh

          zkmgtserver.jar in test.sh is our jar file that starts ZkServer. On
          10.150.27.112 we made zkmgtserver also start a client that creates
          znode /Online. I was able to repro the problem after a few reboots of
          zkmgmtserver on .112. Reboot of zkmgmtserver results in rebooting
          zkserver and the client.

          File stat.out displays output of stat on /Online. We can see that
          after rebooting zkmgmtserver (which reboots zkserver and zkclient),
          old /Online znode reappeared.

          log4j:WARN No appenders could be found for logger (org.apache.zookeeper.ZooKeeper).
          log4j:WARN Please initialize the log4j system properly.
          cZxid = 0x50000054a
          ctime = Tue Dec 07 03:07:59 UTC 2010
          mZxid = 0x50000054a
          mtime = Tue Dec 07 03:07:59 UTC 2010
          pZxid = 0x50000054a
          cversion = 0
          dataVersion = 0
          aclVersion = 0
          ephemeralOwner = 0x22cbeccc7900000
          dataLength = 54
          numChildren = 0

          log4j:WARN No appenders could be found for logger (org.apache.zookeeper.ZooKeeper).
          log4j:WARN Please initialize the log4j system properly.
          cZxid = 0x500000557
          ctime = Tue Dec 07 03:08:21 UTC 2010
          mZxid = 0x500000557
          mtime = Tue Dec 07 03:08:21 UTC 2010
          pZxid = 0x500000557
          cversion = 0
          dataVersion = 0
          aclVersion = 0
          ephemeralOwner = 0x22cbecd1bb60000
          dataLength = 54
          numChildren = 0

          log4j:WARN No appenders could be found for logger (org.apache.zookeeper.ZooKeeper).
          log4j:WARN Please initialize the log4j system properly.
          cZxid = 0x50000054a <==================== OLD zxid
          ctime = Tue Dec 07 03:07:59 UTC 2010
          mZxid = 0x50000054a
          mtime = Tue Dec 07 03:07:59 UTC 2010
          pZxid = 0x50000054a
          cversion = 0
          dataVersion = 0
          aclVersion = 0
          ephemeralOwner = 0x22cbeccc7900000
          dataLength = 54
          numChildren = 0

          Similarly, we can see from ZK-10.150.27.112-zookeeper.log that
          10.150.27.112 reapplied an old transaction to create znode.

          ZK-10.150.27.110-zookeeper.log:2010-12-07 03:07:59,434 - DEBUG [QuorumPeer:/0.0.0.0:2181:CommitProcessor@159] - Committing request:: sessionid:0x22cbeccc7900000 type:create cxid:0x3 zxid:0x50000054a txntype:1 reqpath:n/a
          ZK-10.150.27.110-zookeeper.log:2010-12-07 03:07:59,434 - DEBUG [CommitProcessor:0:FinalRequestProcessor@78] - Processing request:: sessionid:0x22cbeccc7900000 type:create cxid:0x3 zxid:0x50000054a txntype:1 reqpath:n/a
          ZK-10.150.27.111-zookeeper.log:2010-12-07 03:07:59,881 - DEBUG [ProcessThread:-1:CommitProcessor@169] - Processing request:: sessionid:0x22cbeccc7900000 type:create cxid:0x3 zxid:0x50000054a txntype:1 reqpath:n/a
          ZK-10.150.27.111-zookeeper.log:2010-12-07 03:07:59,881 - DEBUG [ProcessThread:-1:Leader@647] - Proposing:: sessionid:0x22cbeccc7900000 type:create cxid:0x3 zxid:0x50000054a txntype:1 reqpath:n/a
          ZK-10.150.27.111-zookeeper.log:2010-12-07 03:07:59,883 - DEBUG [LearnerHandler-/10.150.27.112:56606:CommitProcessor@159] - Committing request:: sessionid:0x22cbeccc7900000 type:create cxid:0x3 zxid:0x50000054a txntype:1 reqpath:n/a
          ZK-10.150.27.111-zookeeper.log:2010-12-07 03:07:59,883 - DEBUG [CommitProcessor:1:FinalRequestProcessor@78] - Processing request:: sessionid:0x22cbeccc7900000 type:create cxid:0x3 zxid:0x50000054a txntype:1 reqpath:n/a
          ZK-10.150.27.112-zookeeper.log:2010-12-07 03:08:00,481 - DEBUG [QuorumPeer:/0.0.0.0:2181:CommitProcessor@159] - Committing request:: sessionid:0x22cbeccc7900000 type:create cxid:0x3 zxid:0x50000054a txntype:1 reqpath:n/a
          ZK-10.150.27.112-zookeeper.log:2010-12-07 03:08:00,481 - DEBUG [CommitProcessor:2:FinalRequestProcessor@78] - Processing request:: sessionid:0x22cbeccc7900000 type:create cxid:0x3 zxid:0x50000054a txntype:1 reqpath:/Online
          ZK-10.150.27.112-zookeeper.log:2010-12-07 03:08:00,481 - DEBUG [CommitProcessor:2:FinalRequestProcessor@160] - sessionid:0x22cbeccc7900000 type:create cxid:0x3 zxid:0x50000054a txntype:1 reqpath:/Online <===================== OLD transaction

          The client used in the test was simple: - On .112 opens a connection
          to local ZkServer and creates a /Online znode if it does not exist or
          if the znode is deleted. It creates a new session to the local ZK
          server after it is rebooted (session ID is not reused).

          Show
          Vishal Kher added a comment - 1. Create 3 node cluster #Tue Dec 07 03:08:43 UTC 2010 server.2=10.150.27.112\:2888\:3888 server.1=10.150.27.111\:2888\:3888 server.0=10.150.27.110\:2888\:3888 initLimit=5 syncLimit=2 snapCount=10000 clientPort=2181 tickTime=2000 dataDir=/var/zookeeper 2. run attached test.sh zkmgtserver.jar in test.sh is our jar file that starts ZkServer. On 10.150.27.112 we made zkmgtserver also start a client that creates znode /Online. I was able to repro the problem after a few reboots of zkmgmtserver on .112. Reboot of zkmgmtserver results in rebooting zkserver and the client. File stat.out displays output of stat on /Online. We can see that after rebooting zkmgmtserver (which reboots zkserver and zkclient), old /Online znode reappeared. log4j:WARN No appenders could be found for logger (org.apache.zookeeper.ZooKeeper). log4j:WARN Please initialize the log4j system properly. cZxid = 0x50000054a ctime = Tue Dec 07 03:07:59 UTC 2010 mZxid = 0x50000054a mtime = Tue Dec 07 03:07:59 UTC 2010 pZxid = 0x50000054a cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x22cbeccc7900000 dataLength = 54 numChildren = 0 log4j:WARN No appenders could be found for logger (org.apache.zookeeper.ZooKeeper). log4j:WARN Please initialize the log4j system properly. cZxid = 0x500000557 ctime = Tue Dec 07 03:08:21 UTC 2010 mZxid = 0x500000557 mtime = Tue Dec 07 03:08:21 UTC 2010 pZxid = 0x500000557 cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x22cbecd1bb60000 dataLength = 54 numChildren = 0 log4j:WARN No appenders could be found for logger (org.apache.zookeeper.ZooKeeper). log4j:WARN Please initialize the log4j system properly. cZxid = 0x50000054a <==================== OLD zxid ctime = Tue Dec 07 03:07:59 UTC 2010 mZxid = 0x50000054a mtime = Tue Dec 07 03:07:59 UTC 2010 pZxid = 0x50000054a cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x22cbeccc7900000 dataLength = 54 numChildren = 0 Similarly, we can see from ZK-10.150.27.112-zookeeper.log that 10.150.27.112 reapplied an old transaction to create znode. ZK-10.150.27.110-zookeeper.log:2010-12-07 03:07:59,434 - DEBUG [QuorumPeer:/0.0.0.0:2181:CommitProcessor@159] - Committing request:: sessionid:0x22cbeccc7900000 type:create cxid:0x3 zxid:0x50000054a txntype:1 reqpath:n/a ZK-10.150.27.110-zookeeper.log:2010-12-07 03:07:59,434 - DEBUG [CommitProcessor:0:FinalRequestProcessor@78] - Processing request:: sessionid:0x22cbeccc7900000 type:create cxid:0x3 zxid:0x50000054a txntype:1 reqpath:n/a ZK-10.150.27.111-zookeeper.log:2010-12-07 03:07:59,881 - DEBUG [ProcessThread:-1:CommitProcessor@169] - Processing request:: sessionid:0x22cbeccc7900000 type:create cxid:0x3 zxid:0x50000054a txntype:1 reqpath:n/a ZK-10.150.27.111-zookeeper.log:2010-12-07 03:07:59,881 - DEBUG [ProcessThread:-1:Leader@647] - Proposing:: sessionid:0x22cbeccc7900000 type:create cxid:0x3 zxid:0x50000054a txntype:1 reqpath:n/a ZK-10.150.27.111-zookeeper.log:2010-12-07 03:07:59,883 - DEBUG [LearnerHandler-/10.150.27.112:56606:CommitProcessor@159] - Committing request:: sessionid:0x22cbeccc7900000 type:create cxid:0x3 zxid:0x50000054a txntype:1 reqpath:n/a ZK-10.150.27.111-zookeeper.log:2010-12-07 03:07:59,883 - DEBUG [CommitProcessor:1:FinalRequestProcessor@78] - Processing request:: sessionid:0x22cbeccc7900000 type:create cxid:0x3 zxid:0x50000054a txntype:1 reqpath:n/a ZK-10.150.27.112-zookeeper.log:2010-12-07 03:08:00,481 - DEBUG [QuorumPeer:/0.0.0.0:2181:CommitProcessor@159] - Committing request:: sessionid:0x22cbeccc7900000 type:create cxid:0x3 zxid:0x50000054a txntype:1 reqpath:n/a ZK-10.150.27.112-zookeeper.log:2010-12-07 03:08:00,481 - DEBUG [CommitProcessor:2:FinalRequestProcessor@78] - Processing request:: sessionid:0x22cbeccc7900000 type:create cxid:0x3 zxid:0x50000054a txntype:1 reqpath:/Online ZK-10.150.27.112-zookeeper.log:2010-12-07 03:08:00,481 - DEBUG [CommitProcessor:2:FinalRequestProcessor@160] - sessionid:0x22cbeccc7900000 type:create cxid:0x3 zxid:0x50000054a txntype:1 reqpath:/Online <===================== OLD transaction The client used in the test was simple: - On .112 opens a connection to local ZkServer and creates a /Online znode if it does not exist or if the znode is deleted. It creates a new session to the local ZK server after it is rebooted (session ID is not reused).
          Hide
          Vishal Kher added a comment -

          Forgot to mention that /Online is present only on .112.

          Show
          Vishal Kher added a comment - Forgot to mention that /Online is present only on .112.
          Hide
          Vishal Kher added a comment -

          Please ignore my comment about "10.150.27.112 reapplied an old transaction to create znode". Looking at the code the messages are seen as expected.

          Show
          Vishal Kher added a comment - Please ignore my comment about "10.150.27.112 reapplied an old transaction to create znode". Looking at the code the messages are seen as expected.
          Hide
          Vishal Kher added a comment -

          The default debug messages are not good enough to debug this bug.
          Attaching more debug logs. I added more debug messages (starting with VK DEBUG).

          What I am seeing is that after restart, while loading snapshot, the server is still able to find an old znode in the snapshot. This is happening even if the snapshot is taken after the transaction that deletes the session (and ephemeral znode) from the ZooKeeper data tree.

          For example, we can see that ephemeral znode (/fooOnline) is deleted during transaction 0xf0000001a. See faulty-zookeeper.log.

          2010-12-09 03:11:50,609 - TRACE [CommitProcessor:2:ZooTrace@89] - :Esessionid:0x22cc91caece0000 type:closeSession cxid:0x0 zxid:0xf0000001a txntype:-11 reqpath:n/a
          2010-12-09 03:11:50,609 - DEBUG [CommitProcessor:2:FinalRequestProcessor@103] - VK DEBUG: sessionid:0x22cc91caece0000 type:closeSession cxid:0x0 zxid:0xf0000001a txntype:-11 reqpath:n/a
          2010-12-09 03:11:50,609 - DEBUG [CommitProcessor:2:DataTree@751] - VK DEBUG: header 156721212080128000,0,64424509466,1291864306011,-11
          2010-12-09 03:11:50,618 - DEBUG [CommitProcessor:2:DataTree@825] - VK DEBUG: Found ephemeral node /fooOnline for session 0x22cc91caece0000
          2010-12-09 03:11:50,619 - DEBUG [CommitProcessor:2:DataTree@839] - Deleting ephemeral node /fooOnline for session 0x22cc91caece0000

          However, after several reboots, the znode still lingers in the data tree.

          2010-12-09 03:12:11,762 - INFO [Thread-2:FileSnap@82] - Reading snapshot /var/zookeeper/version-2/snapshot.f0000001d (zxid f0000001d)
          [...]
          2010-12-09 03:12:11,768 - TRACE [Thread-2:ZooTrace@70] - loadData — session in archive: 22cc91caece0000 with timeout: 9000
          2010-12-09 03:12:11,771 - DEBUG [Thread-2:DataTree@1088] - VK DEBUG Adding path /fooOnline for 22cc91caece0000

          2010-12-09 03:13:22,573 - INFO [QuorumPeer:/0.0.0.0:2181:FileSnap@82] - Reading snapshot /var/zookeeper/version-2/snapshot.f0000002a (zxid f0000002a)
          [...]
          2010-12-09 03:13:22,579 - TRACE [QuorumPeer:/0.0.0.0:2181:ZooTrace@70] - loadData — session in archive: 22cc91caece0000 with timeout: 9000
          [...]
          2010-12-09 03:13:22,581 - DEBUG [QuorumPeer:/0.0.0.0:2181:DataTree@1088] - VK DEBUG Adding path /fooOnline for 22cc91caece0000

          Show
          Vishal Kher added a comment - The default debug messages are not good enough to debug this bug. Attaching more debug logs. I added more debug messages (starting with VK DEBUG). What I am seeing is that after restart, while loading snapshot, the server is still able to find an old znode in the snapshot. This is happening even if the snapshot is taken after the transaction that deletes the session (and ephemeral znode) from the ZooKeeper data tree. For example, we can see that ephemeral znode (/fooOnline) is deleted during transaction 0xf0000001a. See faulty-zookeeper.log. 2010-12-09 03:11:50,609 - TRACE [CommitProcessor:2:ZooTrace@89] - :Esessionid:0x22cc91caece0000 type:closeSession cxid:0x0 zxid:0xf0000001a txntype:-11 reqpath:n/a 2010-12-09 03:11:50,609 - DEBUG [CommitProcessor:2:FinalRequestProcessor@103] - VK DEBUG: sessionid:0x22cc91caece0000 type:closeSession cxid:0x0 zxid:0xf0000001a txntype:-11 reqpath:n/a 2010-12-09 03:11:50,609 - DEBUG [CommitProcessor:2:DataTree@751] - VK DEBUG: header 156721212080128000,0,64424509466,1291864306011,-11 2010-12-09 03:11:50,618 - DEBUG [CommitProcessor:2:DataTree@825] - VK DEBUG: Found ephemeral node /fooOnline for session 0x22cc91caece0000 2010-12-09 03:11:50,619 - DEBUG [CommitProcessor:2:DataTree@839] - Deleting ephemeral node /fooOnline for session 0x22cc91caece0000 However, after several reboots, the znode still lingers in the data tree. 2010-12-09 03:12:11,762 - INFO [Thread-2:FileSnap@82] - Reading snapshot /var/zookeeper/version-2/snapshot.f0000001d (zxid f0000001d) [...] 2010-12-09 03:12:11,768 - TRACE [Thread-2:ZooTrace@70] - loadData — session in archive: 22cc91caece0000 with timeout: 9000 2010-12-09 03:12:11,771 - DEBUG [Thread-2:DataTree@1088] - VK DEBUG Adding path /fooOnline for 22cc91caece0000 2010-12-09 03:13:22,573 - INFO [QuorumPeer:/0.0.0.0:2181:FileSnap@82] - Reading snapshot /var/zookeeper/version-2/snapshot.f0000002a (zxid f0000002a) [...] 2010-12-09 03:13:22,579 - TRACE [QuorumPeer:/0.0.0.0:2181:ZooTrace@70] - loadData — session in archive: 22cc91caece0000 with timeout: 9000 [...] 2010-12-09 03:13:22,581 - DEBUG [QuorumPeer:/0.0.0.0:2181:DataTree@1088] - VK DEBUG Adding path /fooOnline for 22cc91caece0000
          Hide
          Vishal Kher added a comment -

          Forgot to attach output of stat. The session ids are useful for debugging.

          mZxid = 0xf0000000a
          mtime = Thu Dec 09 03:11:05 UTC 2010
          pZxid = 0xf0000000a
          cversion = 0
          dataVersion = 0
          aclVersion = 0
          ephemeralOwner = 0x22cc91c5a700000
          dataLength = 54
          numChildren = 0

          log4j:WARN No appenders could be found for logger (org.apache.zookeeper.ZooKeeper).
          log4j:WARN Please initialize the log4j system properly.
          cZxid = 0xf00000012
          ctime = Thu Dec 09 03:11:27 UTC 2010
          mZxid = 0xf00000012
          mtime = Thu Dec 09 03:11:27 UTC 2010
          pZxid = 0xf00000012
          cversion = 0
          dataVersion = 0
          aclVersion = 0
          ephemeralOwner = 0x22cc91caece0000
          dataLength = 54
          numChildren = 0

          log4j:WARN No appenders could be found for logger (org.apache.zookeeper.ZooKeeper).
          log4j:WARN Please initialize the log4j system properly.
          cZxid = 0xf0000001f
          ctime = Thu Dec 09 03:11:48 UTC 2010
          mZxid = 0xf0000001f
          mtime = Thu Dec 09 03:11:48 UTC 2010
          pZxid = 0xf0000001f
          cversion = 0
          dataVersion = 0
          aclVersion = 0
          ephemeralOwner = 0x22cc91d033c0000
          dataLength = 54
          numChildren = 0

          log4j:WARN No appenders could be found for logger (org.apache.zookeeper.ZooKeeper).
          log4j:WARN Please initialize the log4j system properly.
          cZxid = 0xf00000012
          ctime = Thu Dec 09 03:11:27 UTC 2010
          mZxid = 0xf00000012
          mtime = Thu Dec 09 03:11:27 UTC 2010
          pZxid = 0xf00000012
          cversion = 0
          dataVersion = 0
          aclVersion = 0
          ephemeralOwner = 0x22cc91caece0000
          dataLength = 54
          numChildren = 0

          Show
          Vishal Kher added a comment - Forgot to attach output of stat. The session ids are useful for debugging. mZxid = 0xf0000000a mtime = Thu Dec 09 03:11:05 UTC 2010 pZxid = 0xf0000000a cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x22cc91c5a700000 dataLength = 54 numChildren = 0 log4j:WARN No appenders could be found for logger (org.apache.zookeeper.ZooKeeper). log4j:WARN Please initialize the log4j system properly. cZxid = 0xf00000012 ctime = Thu Dec 09 03:11:27 UTC 2010 mZxid = 0xf00000012 mtime = Thu Dec 09 03:11:27 UTC 2010 pZxid = 0xf00000012 cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x22cc91caece0000 dataLength = 54 numChildren = 0 log4j:WARN No appenders could be found for logger (org.apache.zookeeper.ZooKeeper). log4j:WARN Please initialize the log4j system properly. cZxid = 0xf0000001f ctime = Thu Dec 09 03:11:48 UTC 2010 mZxid = 0xf0000001f mtime = Thu Dec 09 03:11:48 UTC 2010 pZxid = 0xf0000001f cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x22cc91d033c0000 dataLength = 54 numChildren = 0 log4j:WARN No appenders could be found for logger (org.apache.zookeeper.ZooKeeper). log4j:WARN Please initialize the log4j system properly. cZxid = 0xf00000012 ctime = Thu Dec 09 03:11:27 UTC 2010 mZxid = 0xf00000012 mtime = Thu Dec 09 03:11:27 UTC 2010 pZxid = 0xf00000012 cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x22cc91caece0000 dataLength = 54 numChildren = 0
          Hide
          Vishal Kher added a comment -

          Question:

          • Does snapshot.f0000001d contain all committed changes to data tree prior to f0000001d? I know that the snapshot can contain incomplete changes, but I expect it to contain incomplete changes of transactions after f0000001d.

          Note that this bug also introduces a "fork" in ZK data tree. If I re-create the ephemeral znode that is not getting deleted from the faulty ZK server, then the faulty ZK server keeps seeing old znode even if rest of the servers see the new znode.

          Show
          Vishal Kher added a comment - Question: Does snapshot.f0000001d contain all committed changes to data tree prior to f0000001d? I know that the snapshot can contain incomplete changes, but I expect it to contain incomplete changes of transactions after f0000001d. Note that this bug also introduces a "fork" in ZK data tree. If I re-create the ephemeral znode that is not getting deleted from the faulty ZK server, then the faulty ZK server keeps seeing old znode even if rest of the servers see the new znode.
          Hide
          Benjamin Reed added a comment -

          yes, snapshot.f0000001d has all changes f0000001d and some subset of subsequent changes.

          the snapshots do have perfect snapshots of the session table.

          i think what is going on is that you have a follower with the old ephemeral node associated to session X. if the leader doesn't believe that X is a valid session, it will never get expired and the follower will forever think that X is active. the thing i don't get is if you create the znode again on the leader and then delete the znode, it should disappear from all servers.

          Show
          Benjamin Reed added a comment - yes, snapshot.f0000001d has all changes f0000001d and some subset of subsequent changes. the snapshots do have perfect snapshots of the session table. i think what is going on is that you have a follower with the old ephemeral node associated to session X. if the leader doesn't believe that X is a valid session, it will never get expired and the follower will forever think that X is active. the thing i don't get is if you create the znode again on the leader and then delete the znode, it should disappear from all servers.
          Hide
          Vishal Kher added a comment -

          Hi Ben,

          The leader already issued a closesesion transaction (see below) for the session in question. All servers (including the faulty one) processed this transaction. The messages shown below are from the faulty node. Thats why I think this is realted to datatree / snapshot management. Let me know if you think otherwise. Thanks.

          2010-12-09 03:11:50,609 - DEBUG [CommitProcessor:2:FinalRequestProcessor@78] - Processing request:: sessionid:0x22cc91caece0000 type:closeSession cxid:0x0 zxid:0xf0000001a txntype:-11 reqpath:n/a
          2010-12-09 03:11:50,609 - TRACE [CommitProcessor:2:ZooTrace@89] - :Esessionid:0x22cc91caece0000 type:closeSession cxid:0x0 zxid:0xf0000001a txntype:-11 reqpath:n/a
          2010-12-09 03:11:50,609 - DEBUG [CommitProcessor:2:FinalRequestProcessor@103] - VK DEBUG: sessionid:0x22cc91caece0000 type:closeSession cxid:0x0 zxid:0xf0000001a txntype:-11 reqpath:n/a
          2010-12-09 03:11:50,609 - DEBUG [CommitProcessor:2:DataTree@751] - VK DEBUG: header 156721212080128000,0,64424509466,1291864306011,-11
          txn null zxid=f0000001a
          2010-12-09 03:11:50,610 - DEBUG [CommitProcessor:2:DataTree@1217] - VK DEBUG: Stack trace java.lang.Thread.getStackTrace(Thread.java:1436)org.apache.zookeeper.server.DataTree.logStack(DataTree.java:1212)org.apache.zookeeper.server.DataTree.processTxn(DataTree.java:752)org.apache.zookeeper.server.ZKDatabase.processTxn(ZKDatabase.java:299)org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:104)org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73)
          2010-12-09 03:11:50,618 - DEBUG [CommitProcessor:2:DataTree@820] - VK DEBUG: in killSession 22cc91caece0000 zxid f0000001a
          2010-12-09 03:11:50,618 - DEBUG [CommitProcessor:2:DataTree@1217] - VK DEBUG: Stack trace java.lang.Thread.getStackTrace(Thread.java:1436)org.apache.zookeeper.server.DataTree.logStack(DataTree.java:1212)org.apache.zookeeper.server.DataTree.killSession(DataTree.java:821)org.apache.zookeeper.server.DataTree.processTxn(DataTree.java:796)org.apache.zookeeper.server.ZKDatabase.processTxn(ZKDatabase.java:299)org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:104)org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73)
          2010-12-09 03:11:50,618 - DEBUG [CommitProcessor:2:DataTree@825] - VK DEBUG: Found ephemeral node /fooOnline for session 0x22cc91caece0000
          2010-12-09 03:11:50,619 - DEBUG [CommitProcessor:2:DataTree@839] - Deleting ephemeral node /fooOnline for session 0x22cc91caece0000
          2

          Show
          Vishal Kher added a comment - Hi Ben, The leader already issued a closesesion transaction (see below) for the session in question. All servers (including the faulty one) processed this transaction. The messages shown below are from the faulty node. Thats why I think this is realted to datatree / snapshot management. Let me know if you think otherwise. Thanks. 2010-12-09 03:11:50,609 - DEBUG [CommitProcessor:2:FinalRequestProcessor@78] - Processing request:: sessionid:0x22cc91caece0000 type:closeSession cxid:0x0 zxid:0xf0000001a txntype:-11 reqpath:n/a 2010-12-09 03:11:50,609 - TRACE [CommitProcessor:2:ZooTrace@89] - :Esessionid:0x22cc91caece0000 type:closeSession cxid:0x0 zxid:0xf0000001a txntype:-11 reqpath:n/a 2010-12-09 03:11:50,609 - DEBUG [CommitProcessor:2:FinalRequestProcessor@103] - VK DEBUG: sessionid:0x22cc91caece0000 type:closeSession cxid:0x0 zxid:0xf0000001a txntype:-11 reqpath:n/a 2010-12-09 03:11:50,609 - DEBUG [CommitProcessor:2:DataTree@751] - VK DEBUG: header 156721212080128000,0,64424509466,1291864306011,-11 txn null zxid=f0000001a 2010-12-09 03:11:50,610 - DEBUG [CommitProcessor:2:DataTree@1217] - VK DEBUG: Stack trace java.lang.Thread.getStackTrace(Thread.java:1436)org.apache.zookeeper.server.DataTree.logStack(DataTree.java:1212)org.apache.zookeeper.server.DataTree.processTxn(DataTree.java:752)org.apache.zookeeper.server.ZKDatabase.processTxn(ZKDatabase.java:299)org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:104)org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73) 2010-12-09 03:11:50,618 - DEBUG [CommitProcessor:2:DataTree@820] - VK DEBUG: in killSession 22cc91caece0000 zxid f0000001a 2010-12-09 03:11:50,618 - DEBUG [CommitProcessor:2:DataTree@1217] - VK DEBUG: Stack trace java.lang.Thread.getStackTrace(Thread.java:1436)org.apache.zookeeper.server.DataTree.logStack(DataTree.java:1212)org.apache.zookeeper.server.DataTree.killSession(DataTree.java:821)org.apache.zookeeper.server.DataTree.processTxn(DataTree.java:796)org.apache.zookeeper.server.ZKDatabase.processTxn(ZKDatabase.java:299)org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:104)org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73) 2010-12-09 03:11:50,618 - DEBUG [CommitProcessor:2:DataTree@825] - VK DEBUG: Found ephemeral node /fooOnline for session 0x22cc91caece0000 2010-12-09 03:11:50,619 - DEBUG [CommitProcessor:2:DataTree@839] - Deleting ephemeral node /fooOnline for session 0x22cc91caece0000 2
          Hide
          Vishal Kher added a comment -

          Hi Ben,

          I think the ZooKeeper server is dropping multiple transactions after
          it rejoins the cluster. As a result, it drops the transaction that
          delets the ephemeral znode (and many more transactions).

          I have attached the logs. It would good to track the
          createsession/deletessesion/create transactions for all sessions in
          sessions.txt file.

          Log entries that shows the fact that the server is dropping transaction:

          2010-12-14 05:59:19,144 - INFO [Thread-2:FileSnap@82] - Reading snapshot /var/zookeeper/version-2/snapshot.200000016 ===> ***** THIS DOES NOT CONTAIN DT till 200000016 ****
          2010-12-14 05:59:19,147 - TRACE [Thread-2:ZooTrace@70] - loadData — session in archive: 12ce374d71f0001 with timeout: 30000
          2010-12-14 05:59:19,148 - TRACE [Thread-2:ZooTrace@70] - loadData — session in archive: 22ce37526bd0001 with timeout: 30000
          2010-12-14 05:59:19,148 - TRACE [Thread-2:ZooTrace@70] - loadData — session in archive: 12ce374d71f0000 with timeout: 30000
          2010-12-14 05:59:19,148 - TRACE [Thread-2:ZooTrace@70] - loadData — session in archive: 22ce37526bd0002 with timeout: 30000
          2010-12-14 05:59:19,148 - TRACE [Thread-2:ZooTrace@70] - loadData — session in archive: 22ce37526bd0003 with timeout: 30000
          2010-12-14 05:59:19,149 - TRACE [Thread-2:ZooTrace@70] - loadData — session in archive: 22ce3757b780003 with timeout: 30000
          2010-12-14 05:59:19,149 - TRACE [Thread-2:ZooTrace@70] - loadData — session in archive: 22ce3757b780002 with timeout: 30000
          2010-12-14 05:59:19,149 - TRACE [Thread-2:ZooTrace@70] - loadData — session in archive: 22ce3757b780001 with timeout: 30000
          2010-12-14 05:59:19,150 - TRACE [Thread-2:ZooTrace@70] - loadData — session in archive: 22ce3757b780000 with timeout: 9000
          2010-12-14 05:59:19,150 - TRACE [Thread-2:ZooTrace@70] - loadData — session in archive: 2ce374f6b80000 with timeout: 30000
          2010-12-14 05:59:19,150 - TRACE [Thread-2:ZooTrace@70] - loadData — session in archive: 2ce374f6b80001 with timeout: 30000
          2010-12-14 05:59:19,156 - DEBUG [Thread-2:DataTree@1072] - VK DEBUG: in desrialize
          2010-12-14 05:59:19,156 - DEBUG [Thread-2:DataTree@1096] - VK DEBUG Adding path /Online for 22ce3757b780000
          2010-12-14 05:59:19,157 - DEBUG [Thread-2:DataTree@1226] - VK DEBUG: Stack trace java.lang.Thread.getStackTrace(Thread.java:1436)org.apache.zookeeper.server.DataTree.logStack(DataTree.java:1221)org.apache.zookeeper.server.DataTree.deserialize(DataTree.java:1097)org.apache.zookeeper.server.util.SerializeUtils.deserializeSnapshot(SerializeUtils.java:95)org.apache.zookeeper.server.persistence.FileSnap.deserialize(FileSnap.java:126)org.apache.zookeeper.server.persistence.FileSnap.deserialize(FileSnap.java:86)org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:124)org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:197)org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:397)org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:143)org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:103)com.vmware.sva.zkmgmtserver.ZkQuorumPeer.access$200(ZkQuorumPeer.java:26)com.vmware.sva.zkmgmtserver.ZkQuorumPeer$ZooKeeperInstance.run(ZkQuorumPeer.java:78)
          2010-12-14 05:59:19,157 - DEBUG [Thread-2:DataTree@1230] - Printing DataTree
          2010-12-14 05:59:19,158 - DEBUG [Thread-2:DataTree@1232] - Path=/ Node=org.apache.zookeeper.server.DataNode@3ba42792
          2010-12-14 05:59:19,158 - DEBUG [Thread-2:DataTree@1232] - Path=/Online Node=org.apache.zookeeper.server.DataNode@162200d5
          2010-12-14 05:59:19,158 - DEBUG [Thread-2:DataTree@1232] - Path= Node=org.apache.zookeeper.server.DataNode@3ba42792
          2010-12-14 05:59:19,158 - DEBUG [Thread-2:DataTree@1232] - Path=/zookeeper Node=org.apache.zookeeper.server.DataNode@1264ab4d
          2010-12-14 05:59:19,159 - DEBUG [Thread-2:DataTree@1232] - Path=/zookeeper/quota Node=org.apache.zookeeper.server.DataNode@f2a55aa
          2010-12-14 05:59:19,162 - DEBUG [Thread-2:FileTxnLog$FileTxnIterator@527] - Created new input stream /var/zookeeper/version-2/log.200000011 ==> ****** THIS DOES NOT CONTAIN TX FROM 200000011 - 200000016 *********
          2010-12-14 05:59:19,167 - DEBUG [Thread-2:FileTxnLog$FileTxnIterator@530] - Created new input archive /var/zookeeper/version-2/log.200000011
          2010-12-14 05:59:19,168 - DEBUG [Thread-2:FileTxnSnapLog@178] - VK debug: 156750179351724034,0,8589934614,1292306326006,-11

          The problem is that snapshot snapshot.200000016 does not contain all
          transactions till snapshot.200000016. Log log.200000011 will contain
          transactions 200000011 - 200000015, but the server will ignore them
          since it thinks snapshot.200000016 has those transactions. As a
          result, it will drop those transactions.

          I think the zxid management during startup is buggy. Possibly one or
          more of the following issues is causing trouble:

          1. Race conditions/ incorrect zxid received from leader at two places
          a) newLeaderZxid different from zxid received on qp in Learner.syncWithLeader()

          zk.getZKDatabase().setlastProcessedZxid(newLeaderZxid);

          The dataTree.lastProcessedZxid is probably getting set incorrectly
          (higher than the correct zxid). This is causing transactions from the
          correct zxid to lastProcessed to be dropped during log replay in
          FileTxnSnapLog.java. I think this is the main cause.

          b) Learner does not wait until it receives the entire snapshot from
          the leader (see snapshot.200000016)

          -Vishal

          Show
          Vishal Kher added a comment - Hi Ben, I think the ZooKeeper server is dropping multiple transactions after it rejoins the cluster. As a result, it drops the transaction that delets the ephemeral znode (and many more transactions). I have attached the logs. It would good to track the createsession/deletessesion/create transactions for all sessions in sessions.txt file. Log entries that shows the fact that the server is dropping transaction: 2010-12-14 05:59:19,144 - INFO [Thread-2:FileSnap@82] - Reading snapshot /var/zookeeper/version-2/snapshot.200000016 ===> ***** THIS DOES NOT CONTAIN DT till 200000016 **** 2010-12-14 05:59:19,147 - TRACE [Thread-2:ZooTrace@70] - loadData — session in archive: 12ce374d71f0001 with timeout: 30000 2010-12-14 05:59:19,148 - TRACE [Thread-2:ZooTrace@70] - loadData — session in archive: 22ce37526bd0001 with timeout: 30000 2010-12-14 05:59:19,148 - TRACE [Thread-2:ZooTrace@70] - loadData — session in archive: 12ce374d71f0000 with timeout: 30000 2010-12-14 05:59:19,148 - TRACE [Thread-2:ZooTrace@70] - loadData — session in archive: 22ce37526bd0002 with timeout: 30000 2010-12-14 05:59:19,148 - TRACE [Thread-2:ZooTrace@70] - loadData — session in archive: 22ce37526bd0003 with timeout: 30000 2010-12-14 05:59:19,149 - TRACE [Thread-2:ZooTrace@70] - loadData — session in archive: 22ce3757b780003 with timeout: 30000 2010-12-14 05:59:19,149 - TRACE [Thread-2:ZooTrace@70] - loadData — session in archive: 22ce3757b780002 with timeout: 30000 2010-12-14 05:59:19,149 - TRACE [Thread-2:ZooTrace@70] - loadData — session in archive: 22ce3757b780001 with timeout: 30000 2010-12-14 05:59:19,150 - TRACE [Thread-2:ZooTrace@70] - loadData — session in archive: 22ce3757b780000 with timeout: 9000 2010-12-14 05:59:19,150 - TRACE [Thread-2:ZooTrace@70] - loadData — session in archive: 2ce374f6b80000 with timeout: 30000 2010-12-14 05:59:19,150 - TRACE [Thread-2:ZooTrace@70] - loadData — session in archive: 2ce374f6b80001 with timeout: 30000 2010-12-14 05:59:19,156 - DEBUG [Thread-2:DataTree@1072] - VK DEBUG: in desrialize 2010-12-14 05:59:19,156 - DEBUG [Thread-2:DataTree@1096] - VK DEBUG Adding path /Online for 22ce3757b780000 2010-12-14 05:59:19,157 - DEBUG [Thread-2:DataTree@1226] - VK DEBUG: Stack trace java.lang.Thread.getStackTrace(Thread.java:1436)org.apache.zookeeper.server.DataTree.logStack(DataTree.java:1221)org.apache.zookeeper.server.DataTree.deserialize(DataTree.java:1097)org.apache.zookeeper.server.util.SerializeUtils.deserializeSnapshot(SerializeUtils.java:95)org.apache.zookeeper.server.persistence.FileSnap.deserialize(FileSnap.java:126)org.apache.zookeeper.server.persistence.FileSnap.deserialize(FileSnap.java:86)org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:124)org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:197)org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:397)org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:143)org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:103)com.vmware.sva.zkmgmtserver.ZkQuorumPeer.access$200(ZkQuorumPeer.java:26)com.vmware.sva.zkmgmtserver.ZkQuorumPeer$ZooKeeperInstance.run(ZkQuorumPeer.java:78) 2010-12-14 05:59:19,157 - DEBUG [Thread-2:DataTree@1230] - Printing DataTree 2010-12-14 05:59:19,158 - DEBUG [Thread-2:DataTree@1232] - Path=/ Node=org.apache.zookeeper.server.DataNode@3ba42792 2010-12-14 05:59:19,158 - DEBUG [Thread-2:DataTree@1232] - Path=/Online Node=org.apache.zookeeper.server.DataNode@162200d5 2010-12-14 05:59:19,158 - DEBUG [Thread-2:DataTree@1232] - Path= Node=org.apache.zookeeper.server.DataNode@3ba42792 2010-12-14 05:59:19,158 - DEBUG [Thread-2:DataTree@1232] - Path=/zookeeper Node=org.apache.zookeeper.server.DataNode@1264ab4d 2010-12-14 05:59:19,159 - DEBUG [Thread-2:DataTree@1232] - Path=/zookeeper/quota Node=org.apache.zookeeper.server.DataNode@f2a55aa 2010-12-14 05:59:19,162 - DEBUG [Thread-2:FileTxnLog$FileTxnIterator@527] - Created new input stream /var/zookeeper/version-2/log.200000011 ==> ****** THIS DOES NOT CONTAIN TX FROM 200000011 - 200000016 ********* 2010-12-14 05:59:19,167 - DEBUG [Thread-2:FileTxnLog$FileTxnIterator@530] - Created new input archive /var/zookeeper/version-2/log.200000011 2010-12-14 05:59:19,168 - DEBUG [Thread-2:FileTxnSnapLog@178] - VK debug: 156750179351724034,0,8589934614,1292306326006,-11 The problem is that snapshot snapshot.200000016 does not contain all transactions till snapshot.200000016. Log log.200000011 will contain transactions 200000011 - 200000015, but the server will ignore them since it thinks snapshot.200000016 has those transactions. As a result, it will drop those transactions. I think the zxid management during startup is buggy. Possibly one or more of the following issues is causing trouble: 1. Race conditions/ incorrect zxid received from leader at two places a) newLeaderZxid different from zxid received on qp in Learner.syncWithLeader() zk.getZKDatabase().setlastProcessedZxid(newLeaderZxid); The dataTree.lastProcessedZxid is probably getting set incorrectly (higher than the correct zxid). This is causing transactions from the correct zxid to lastProcessed to be dropped during log replay in FileTxnSnapLog.java. I think this is the main cause. b) Learner does not wait until it receives the entire snapshot from the leader (see snapshot.200000016) -Vishal
          Hide
          Vishal Kher added a comment -

          Hi,

          I am attaching a diff that gets around this bug. I don't think this is a real fix. Hopefully, it will help to identify the final fix.

          -Vishal

          Show
          Vishal Kher added a comment - Hi, I am attaching a diff that gets around this bug. I don't think this is a real fix. Hopefully, it will help to identify the final fix. -Vishal
          Hide
          Camille Fournier added a comment -

          Ok I managed to figure out what is going on here, at least one thing that could be wrong.
          Basically, when a follower starts up, it reads the latest snap file and sets the dataTree.lastProcessedZxid to be the Zxid specified by that file. This is not the right thing to do. When you call snapLog.restore (before you connect to the leader), you both deserialize the latest snap log, AND play through the transaction logs. While playing through the transaction logs, you find the actual last transaction you have recorded which could be less than the snapLog's zxid (will explain below). That zxid is returned by the call to restore, but we don't actually use it, you always use the Zxid from the chose snap file (see FileSnap.deserialize).

          So, follower crashes at zxid N-5, starts back up, replays state to N-5, and connects to the leader.

          When it connects to the leader, it gets the leader's zxid of N. In Learner.syncWithLeader, it sees that it will get a diff, and sets its lastProcessedZxid to N, then acks the leader and starts up. The leader queues up the transactions N-4...N and sends those to the follower, followed finally by the UPTODATE message.

          So what does the Follower now see? Well it sees the transactions in order, PROPOSAL, then COMMIT, for all of N-4 to N, followed by the UPTODATE. However, the Follower processes the PROPOSAL and COMMIT packets in separate threads, so it pulls those messages off the wire fast, queues them to be processed by the Sync/Commit processor threads, and keeps going until it hits the UPTODATE message. At this point, it writes a snapshot with Zxid N. We have NO GUARANTEE that we have actually processed transactions N-4 to N, and as Vishal has seen, sometimes the thread that writes the snapshot gets executed before any or all of the commit processor threads have actually run.

          But generally, this is ok, because you are also writing a transaction log of transactions N-4 to N, so you still process them, write them to your transaction log, and as long as you have processed all of them before your Follower goes down again when you recover they will be applied to the snapshot and you will be fine. HOWEVER, if you kill the Follower after it has written snapshot.N and before it has processed transactions N-4 to N and written them to its log, when you restore the Follower it will believe that it is at Zxid N, it won't ever see those transactions, and it will never delete those nodes.

          I believe one solution might be, when you are restoring the database on startup to set the lastProcessedZxid to the zxid of the last transaction in your actual tx log, NOT the zxid of the snapshot file. However, that might make restoring from copied snapshots break, so I think we should consider other solution options. But setting the zxid of the follower to be the server's zxid before it processes all the missing transactions is definitely the wrong thing to do.

          Show
          Camille Fournier added a comment - Ok I managed to figure out what is going on here, at least one thing that could be wrong. Basically, when a follower starts up, it reads the latest snap file and sets the dataTree.lastProcessedZxid to be the Zxid specified by that file. This is not the right thing to do. When you call snapLog.restore (before you connect to the leader), you both deserialize the latest snap log, AND play through the transaction logs. While playing through the transaction logs, you find the actual last transaction you have recorded which could be less than the snapLog's zxid (will explain below). That zxid is returned by the call to restore, but we don't actually use it, you always use the Zxid from the chose snap file (see FileSnap.deserialize). So, follower crashes at zxid N-5, starts back up, replays state to N-5, and connects to the leader. When it connects to the leader, it gets the leader's zxid of N. In Learner.syncWithLeader, it sees that it will get a diff, and sets its lastProcessedZxid to N, then acks the leader and starts up. The leader queues up the transactions N-4...N and sends those to the follower, followed finally by the UPTODATE message. So what does the Follower now see? Well it sees the transactions in order, PROPOSAL, then COMMIT, for all of N-4 to N, followed by the UPTODATE. However, the Follower processes the PROPOSAL and COMMIT packets in separate threads , so it pulls those messages off the wire fast, queues them to be processed by the Sync/Commit processor threads, and keeps going until it hits the UPTODATE message. At this point, it writes a snapshot with Zxid N. We have NO GUARANTEE that we have actually processed transactions N-4 to N, and as Vishal has seen, sometimes the thread that writes the snapshot gets executed before any or all of the commit processor threads have actually run. But generally, this is ok, because you are also writing a transaction log of transactions N-4 to N, so you still process them, write them to your transaction log, and as long as you have processed all of them before your Follower goes down again when you recover they will be applied to the snapshot and you will be fine. HOWEVER, if you kill the Follower after it has written snapshot.N and before it has processed transactions N-4 to N and written them to its log, when you restore the Follower it will believe that it is at Zxid N, it won't ever see those transactions, and it will never delete those nodes. I believe one solution might be, when you are restoring the database on startup to set the lastProcessedZxid to the zxid of the last transaction in your actual tx log, NOT the zxid of the snapshot file. However, that might make restoring from copied snapshots break, so I think we should consider other solution options. But setting the zxid of the follower to be the server's zxid before it processes all the missing transactions is definitely the wrong thing to do.
          Hide
          Flavio Junqueira added a comment -

          Is ZOOKEEPER-882 related? To me it sounds like...

          Show
          Flavio Junqueira added a comment - Is ZOOKEEPER-882 related? To me it sounds like...
          Hide
          Camille Fournier added a comment -

          I don't think so, correct me if I am wrong:
          Starting the log from the last txn in the snapshot (which is set in this case by the actual file name of the snapshot, not the last transaction it actually contains) will definitely not fix this problem since the snapshot does not contain the missing transactions. In fact, if you did this, you would end up even worse off since you would expand the window in which the follower could crash and lose those transactions (not only before they are logged, but any time before a snapshot that actually contains a record of the missing transactions is taken).
          In the patch submitted on 882, we still don't actually use that log file last transaction id anywhere, so that fix is pretty meaningless for the actual functioning of the system so far as I can tell. If we were using it, we probably wouldn't have this problem.

          Show
          Camille Fournier added a comment - I don't think so, correct me if I am wrong: Starting the log from the last txn in the snapshot (which is set in this case by the actual file name of the snapshot, not the last transaction it actually contains) will definitely not fix this problem since the snapshot does not contain the missing transactions. In fact, if you did this, you would end up even worse off since you would expand the window in which the follower could crash and lose those transactions (not only before they are logged, but any time before a snapshot that actually contains a record of the missing transactions is taken). In the patch submitted on 882, we still don't actually use that log file last transaction id anywhere, so that fix is pretty meaningless for the actual functioning of the system so far as I can tell. If we were using it, we probably wouldn't have this problem.
          Hide
          Benjamin Reed added a comment -

          camille, you are brilliant! excellent catch.

          i think main problem is zk.getZKDatabase().setlastProcessedZxid(newLeaderZxid) in syncWithLeader().

          we need to fix that code. we also have the problem that we are adding transactions to the log without having the respective snapshot on disk.

          i think we need to process the requests we get between the snapshot and UPTODATE in that method in the same thread. so that we get a consistent snapshot on disk tagged with the right transaction id.

          the tricky part is to build a test case for this

          Show
          Benjamin Reed added a comment - camille, you are brilliant! excellent catch. i think main problem is zk.getZKDatabase().setlastProcessedZxid(newLeaderZxid) in syncWithLeader(). we need to fix that code. we also have the problem that we are adding transactions to the log without having the respective snapshot on disk. i think we need to process the requests we get between the snapshot and UPTODATE in that method in the same thread. so that we get a consistent snapshot on disk tagged with the right transaction id. the tricky part is to build a test case for this
          Hide
          Vishal Kher added a comment -

          Were you guys following the comments that I posted earlier? Any comments on the attached patch?

          Show
          Vishal Kher added a comment - Were you guys following the comments that I posted earlier? Any comments on the attached patch?
          Hide
          Vishal Kher added a comment -

          Hi,

          Just to clarify, the patch that I had attached earlier (zk.patch) sets the zxid right. I have tested it by repeating my test (mentioned above) 100 times.
          However, I am not 100% sure if this is the final patch. I think the patch may not be optimal. It might cause reapplying some extra transactions. It will be nice if someone can look at it.

          Thanks

          Show
          Vishal Kher added a comment - Hi, Just to clarify, the patch that I had attached earlier (zk.patch) sets the zxid right. I have tested it by repeating my test (mentioned above) 100 times. However, I am not 100% sure if this is the final patch. I think the patch may not be optimal. It might cause reapplying some extra transactions. It will be nice if someone can look at it. Thanks
          Hide
          Benjamin Reed added a comment -

          where is the patch vishal?

          Show
          Benjamin Reed added a comment - where is the patch vishal?
          Hide
          Vishal Kher added a comment -

          Hi Ben,

          Since I wasn't sure, I had submitted the diff (zk.patch)as an attachment (please see my comments on 15/Dec/10 04). Thanks.

          Show
          Vishal Kher added a comment - Hi Ben, Since I wasn't sure, I had submitted the diff (zk.patch)as an attachment (please see my comments on 15/Dec/10 04). Thanks.
          Hide
          Camille Fournier added a comment -

          Vishal, I think your patch is ok, but given that we know what is likely to cause this issue, can you think of a test to verify the fix?

          Show
          Camille Fournier added a comment - Vishal, I think your patch is ok, but given that we know what is likely to cause this issue, can you think of a test to verify the fix?
          Hide
          Benjamin Reed added a comment -

          vishal your patch is a start, but i think the issue identified by camille is a bit more profound.

          i'm going to work on the test case for this issue and the other that camille identified. in the meantime it would be interesting to see if just doing what you have in the patch makes your problem go away.

          Show
          Benjamin Reed added a comment - vishal your patch is a start, but i think the issue identified by camille is a bit more profound. i'm going to work on the test case for this issue and the other that camille identified. in the meantime it would be interesting to see if just doing what you have in the patch makes your problem go away.
          Hide
          Chang Song added a comment -

          thank you very much, first of all
          I am in the process of testing Vishal's patch
          I will report back the result soon

          Show
          Chang Song added a comment - thank you very much, first of all I am in the process of testing Vishal's patch I will report back the result soon
          Hide
          Vishal Kher added a comment -

          Hi Camille,

          I forgot to mention about this earlier. In your description above you
          mentioned "But generally, this is ok, because you are also writing a
          transaction log of transactions N-4 to N, so you still process them,
          write them to your transaction log, and as long as you have processed
          all of them before your Follower goes down again when you recover they
          will be applied to the snapshot and you will be fine. HOWEVER, if you
          kill the Follower after it has written snapshot.N and before it has
          processed transactions N-4 to N and written them to its log, when you
          restore the Follower it will believe that it is at Zxid N, it won't
          ever see those transactions, and it will never delete those nodes."

          I don't think this is true. I don't think that we need to kill the
          follower after creating snapshot.N file and before writing N to the
          log. We have a bug even without this failure (and my test was not
          failing the Follower during this window). FileTxnSnapLog.restore()
          does not replay transactions that are < zxid of the snapshot. As a
          result, we have this bug even without the failure that you
          mentioned. Let me know if you think otherwise.

          My change takes a safe approach and ensures that the zxid of the
          snapshot file at the follower is equal to the zxid of the last
          transaction processed by the follower.

          Ideally, I think we need to change Follower.processPacket() to make
          sure that while handling UPTODATE before taking snapshot all the
          necessary transactions are committed locally.

          Thanks.

          Show
          Vishal Kher added a comment - Hi Camille, I forgot to mention about this earlier. In your description above you mentioned "But generally, this is ok, because you are also writing a transaction log of transactions N-4 to N, so you still process them, write them to your transaction log, and as long as you have processed all of them before your Follower goes down again when you recover they will be applied to the snapshot and you will be fine. HOWEVER, if you kill the Follower after it has written snapshot.N and before it has processed transactions N-4 to N and written them to its log, when you restore the Follower it will believe that it is at Zxid N, it won't ever see those transactions, and it will never delete those nodes." I don't think this is true. I don't think that we need to kill the follower after creating snapshot.N file and before writing N to the log. We have a bug even without this failure (and my test was not failing the Follower during this window). FileTxnSnapLog.restore() does not replay transactions that are < zxid of the snapshot. As a result, we have this bug even without the failure that you mentioned. Let me know if you think otherwise. My change takes a safe approach and ensures that the zxid of the snapshot file at the follower is equal to the zxid of the last transaction processed by the follower. Ideally, I think we need to change Follower.processPacket() to make sure that while handling UPTODATE before taking snapshot all the necessary transactions are committed locally. Thanks.
          Hide
          Vishal Kher added a comment -

          Ben,

          I think the patch does address my scenario.
          I have tested it by repeating my test (mentioned above) 100 times.
          You are right, this patch does not address the problem Camille pointed
          out in https://issues.apache.org/jira/browse/ZOOKEEPER-962.
          These are independent problems.

          Show
          Vishal Kher added a comment - Ben, I think the patch does address my scenario. I have tested it by repeating my test (mentioned above) 100 times. You are right, this patch does not address the problem Camille pointed out in https://issues.apache.org/jira/browse/ZOOKEEPER-962 . These are independent problems.
          Hide
          Camille Fournier added a comment -

          Where in the code do you think we are not processing transactions that are < zxid of the snapshot? FileTxnSnapLog.restore() will grab the tran logs with zxid greater than or equal to the zxid of the highest snapshot, and the last file with a zxid less than the zxid of the highest snapshot, and process transactions on those zxids whether or not they are higher than the highestZxid. Specifically, that if/else in restore line 137 does not break out if the hdr.getZxid < highestZxid, it just logs that the occurrence, and process transaction will still run on those lower zxid logs.

          Show
          Camille Fournier added a comment - Where in the code do you think we are not processing transactions that are < zxid of the snapshot? FileTxnSnapLog.restore() will grab the tran logs with zxid greater than or equal to the zxid of the highest snapshot, and the last file with a zxid less than the zxid of the highest snapshot, and process transactions on those zxids whether or not they are higher than the highestZxid. Specifically, that if/else in restore line 137 does not break out if the hdr.getZxid < highestZxid, it just logs that the occurrence, and process transaction will still run on those lower zxid logs.
          Hide
          Vishal Kher added a comment -

          Sorry, the explanation wasn't right. I meant to say that restore() starts from txnLog.read(dt.lastProcessedZxid). So if lastProcessZxid is set higher (as in this bug), then even if the right transactions are in the log they won't get replayed.
          I was seeing this during debugging and I didn't have to crash the follower in that window (after creating snapshot.N file and before writing N to the log).

          Show
          Vishal Kher added a comment - Sorry, the explanation wasn't right. I meant to say that restore() starts from txnLog.read(dt.lastProcessedZxid). So if lastProcessZxid is set higher (as in this bug), then even if the right transactions are in the log they won't get replayed. I was seeing this during debugging and I didn't have to crash the follower in that window (after creating snapshot.N file and before writing N to the log).
          Hide
          Vishal Kher added a comment -

          Hi Chang,

          Can you please see which bug you are running into by trying out the patch? Thanks.

          Show
          Vishal Kher added a comment - Hi Chang, Can you please see which bug you are running into by trying out the patch? Thanks.
          Hide
          Chang Song added a comment -

          Vishal
          I have been too busy and couldn't find a chance to try out the patch due to our staging ZK ensemble should be up 100%
          I will possible get a chance this or next week to try out the patch, and I will definitely report back the result.
          thank you very very much

          Show
          Chang Song added a comment - Vishal I have been too busy and couldn't find a chance to try out the patch due to our staging ZK ensemble should be up 100% I will possible get a chance this or next week to try out the patch, and I will definitely report back the result. thank you very very much
          Hide
          Vishal Kher added a comment -

          Ignore the attached patch. It fixes the znode inconsistency problem, but causes a server to reject connections from clients.

          Show
          Vishal Kher added a comment - Ignore the attached patch. It fixes the znode inconsistency problem, but causes a server to reject connections from clients.
          Hide
          Camille Fournier added a comment -

          Can you try the patch submitted for ZOOKEEPER-962? It should fix these issues.

          Show
          Camille Fournier added a comment - Can you try the patch submitted for ZOOKEEPER-962 ? It should fix these issues.
          Hide
          Chang Song added a comment -

          We are not able to reproduce this original scenario with the patch.
          We will update this bug once we have final result.

          Thank you very much.

          Show
          Chang Song added a comment - We are not able to reproduce this original scenario with the patch. We will update this bug once we have final result. Thank you very much.
          Hide
          Benjamin Reed added a comment -

          fix included in ZOOKEEPER-962

          Show
          Benjamin Reed added a comment - fix included in ZOOKEEPER-962

            People

            • Assignee:
              Unassigned
              Reporter:
              Chang Song
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development