ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-1449

Ephemeral znode not deleted after session has expired on one follower (quorum is in an inconsistent state)

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Cannot Reproduce
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      I've been running in to this situation in our labs fairly regularly where we'll get a single follower that will be in an inconsistent state with dangling ephemeral znodes. Here is all of the information that I have right now. Please ask if there is anything else that is useful.

      Here is a quick snapshot of the state of the ensemble where you can see it is out of sync across several znodes:

      -bash-3.2$ echo srvr | nc il23n04sa-zk001 2181
      Zookeeper version: 3.3.3-cdh3u2--1, built on 10/14/2011 05:17 GMT
      Latency min/avg/max: 0/7/25802
      Received: 64002
      Sent: 63985
      Outstanding: 0
      Zxid: 0x500000a41
      Mode: follower
      Node count: 497

      -bash-3.2$ echo srvr | nc il23n04sa-zk002 2181
      Zookeeper version: 3.3.3-cdh3u2--1, built on 10/14/2011 05:17 GMT
      Latency min/avg/max: 0/13/79032
      Received: 74320
      Sent: 74276
      Outstanding: 0
      Zxid: 0x500000a41
      Mode: leader
      Node count: 493

      -bash-3.2$ echo srvr | nc il23n04sa-zk003 2181
      Zookeeper version: 3.3.3-cdh3u2--1, built on 10/14/2011 05:17 GMT
      Latency min/avg/max: 0/2/25234
      Received: 187310
      Sent: 187320
      Outstanding: 0
      Zxid: 0x500000a41
      Mode: follower
      Node count: 493

      All of the zxids match up just fine but zk001 has 4 more nodes in its node count than the other two (including the leader..). When I use a zookeeper client connect to connect directly to zk001 I can see the following znode that should no longer exist:

      [zk: localhost:2181(CONNECTED) 0] stat /siri/Douroucouli/clients/il23n04sa-app004.siri.apple.com:38096
      cZxid = 0x40000001a
      ctime = Mon Apr 16 11:00:47 PDT 2012
      mZxid = 0x40000001a
      mtime = Mon Apr 16 11:00:47 PDT 2012
      pZxid = 0x40000001a
      cversion = 0
      dataVersion = 0
      aclVersion = 0
      ephemeralOwner = 0x236bc504cb50002
      dataLength = 0
      numChildren = 0

      This node does not exist using the client to connect to either of the other two members of the ensemble.

      I searched through the logs for that session id and it looks like it was established and closed cleanly. There were several leadership/quorum problems during the course of the session but it looks like it should have been shut down properly. Neither the session nor the znode show up in the "dump" on the leader but the problem znode does show up in the "dump" on zk001.

      2012-04-16 11:00:47,637 - INFO [CommitProcessor:2:NIOServerCnxn@1580] - Established session 0x236bc504cb50002 with negotiated timeout 15000 for client /17.202.71.201:38971
      2012-04-16 11:20:59,341 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@770] - Client attempting to renew session 0x236bc504cb50002 at /17.202.71.201:50841
      2012-04-16 11:20:59,342 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1580] - Established session 0x236bc504cb50002 with negotiated timeout 15000 for client /17.202.71.201:50841
      2012-04-16 11:21:09,343 - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@634] - EndOfStreamException: Unable to read additional data from client sessionid 0x236bc504cb50002, likely client has closed socket
      2012-04-16 11:21:09,343 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed socket connection for client /17.202.71.201:50841 which had sessionid 0x236bc504cb50002
      2012-04-16 11:21:20,352 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@1435] - Closed socket connection for client /17.202.71.201:38971 which had sessionid 0x236bc504cb50002
      2012-04-16 11:21:22,151 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@770] - Client attempting to renew session 0x236bc504cb50002 at /17.202.71.201:38166
      2012-04-16 11:21:22,152 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@1580] - Established session 0x236bc504cb50002 with negotiated timeout 15000 for client /17.202.71.201:38166
      2012-04-16 11:27:17,902 - INFO [ProcessThread:-1:PrepRequestProcessor@387] - Processed session termination for sessionid: 0x236bc504cb50002
      2012-04-16 11:27:17,904 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed socket connection for client /17.202.71.201:38166 which had sessionid 0x236bc504cb50002

      The only way I've been able to recover from this situation is to shut down the problem follower, delete its snapshots and let it resync with the leader.

      I'll attach the full log4j logs, the txn logs, the snapshots and the conf files for each member of the ensemble. Please let me know what other information is useful.

      1. snaps.tar
        50 kB
        Germán Blanco

        Issue Links

          Activity

          Daniel Lord created issue -
          Hide
          Daniel Lord added a comment -

          Log4j logs, txn logs, snapshots, etc.

          Show
          Daniel Lord added a comment - Log4j logs, txn logs, snapshots, etc.
          Daniel Lord made changes -
          Field Original Value New Value
          Attachment zk.zip [ 12522997 ]
          Hide
          Camille Fournier added a comment -

          Can you reproduce it with a more recent release? 3.3.3 is a bit old at this point and we've fixed a few things between that and 3.3.5.

          Show
          Camille Fournier added a comment - Can you reproduce it with a more recent release? 3.3.3 is a bit old at this point and we've fixed a few things between that and 3.3.5.
          Hide
          Daniel Lord added a comment -

          It's not something I know how to reproduce at the moment. I'm going to see if we can upgrade to something newer in our labs. The prime suspects look to be 1208 and 1367. It doesn't look to me like it's 1208 but I'm not positive its not another corner case for that. It seems like it could be 1367 or close to it but I'm not sure.

          I'll try and upgrade and if it recreates I'll attach more logs/traces.

          Show
          Daniel Lord added a comment - It's not something I know how to reproduce at the moment. I'm going to see if we can upgrade to something newer in our labs. The prime suspects look to be 1208 and 1367. It doesn't look to me like it's 1208 but I'm not positive its not another corner case for that. It seems like it could be 1367 or close to it but I'm not sure. I'll try and upgrade and if it recreates I'll attach more logs/traces.
          Hide
          Daniel Lord added a comment -

          Oh, one other thing that might be worth noting. These are virtualized machines. Each of these VMs is only responsible for hosting zookeeper but is it possible that the virtualization is causing problems? I don't have many details on the virtualization layer but if it's something worth looking in to I can dig further.

          Show
          Daniel Lord added a comment - Oh, one other thing that might be worth noting. These are virtualized machines. Each of these VMs is only responsible for hosting zookeeper but is it possible that the virtualization is causing problems? I don't have many details on the virtualization layer but if it's something worth looking in to I can dig further.
          Hide
          Camille Fournier added a comment -

          Virtualization shouldn't be a problem. It's probably one of those bugs listed above, but if not we'll definitely want to track it down.

          Show
          Camille Fournier added a comment - Virtualization shouldn't be a problem. It's probably one of those bugs listed above, but if not we'll definitely want to track it down.
          Daniel Lord made changes -
          Attachment zk.zip [ 12522997 ]
          Hide
          Patrick Hunt added a comment -

          I've looked into this issue and I believe with high confidence that this is ZOOKEEPER-1156 "Log truncation truncating log too much - can cause data loss", which is fixed in 3.3.4 and beyond.

          ZOOKEEPER-1156 can lead to over-truncation of the txnlog, which is what I'm seeing in the logs provided by Danny, here's a corrupted record, which when I look the binary directly I can see that is likely a result of file truncation:

          Output from LogFormatter (dumps the txnlog in ascii form):

          4/16/12 11:00:54 AM PDT session 0x236bc504cb50005 cxid 0x1e zxid 0x40000009b create
          4/16/12 11:00:54 AM PDT session 0x36bc50622f0002 cxid 0x19 zxid 0x40000009c create
          4/16/12 11:00:54 AM PDT session 0x36bc50622f0004 cxid 0x19 zxid 0x40000009d create
          CRC doesn't match 3908382071 vs 1875708430 skipping to next entry
          Skipped 181778 bytes
          4/17/12 12:03:30 AM PDT session 0x236bc633856000f cxid 0x0 zxid 0x500000330 closeSession
          4/17/12 12:03:30 AM PDT session 0x236bc6338560005 cxid 0x0 zxid 0x500000331 closeSession
          4/17/12 12:03:32 AM PDT session 0x236bc6338560004 cxid 0x0 zxid 0x500000332 closeSession

          which corresponds to this in the server log:

          2012-04-17 00:03:47,077 - WARN [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Learner@306] - Truncating log to get in sync with the leader 0x50000032f
          2012-04-17 00:03:47,166 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FileSnap@82] - Reading snapshot /data1/zookeeper/version-2/snapshot.400000156

          I also applied the same code that's fixed in ZOOKEEPER-1156, in it's unfixed state (ie what's in 3.3.3) to one of the log files from one of the other servers. When I do this, applying the same flawed truncation operation, I see that the calculated offset is very close to the location above (in terms of the zxid that has been chopped/corrupted). This further convinces me that ZOOKEEPER-1156 is at fault here.

          As a result of this over-truncation we are missing a number of records, including the session close record which would normally cause the ephemeral node identified by Danny to be removed.

          Show
          Patrick Hunt added a comment - I've looked into this issue and I believe with high confidence that this is ZOOKEEPER-1156 "Log truncation truncating log too much - can cause data loss", which is fixed in 3.3.4 and beyond. ZOOKEEPER-1156 can lead to over-truncation of the txnlog, which is what I'm seeing in the logs provided by Danny, here's a corrupted record, which when I look the binary directly I can see that is likely a result of file truncation: Output from LogFormatter (dumps the txnlog in ascii form): 4/16/12 11:00:54 AM PDT session 0x236bc504cb50005 cxid 0x1e zxid 0x40000009b create 4/16/12 11:00:54 AM PDT session 0x36bc50622f0002 cxid 0x19 zxid 0x40000009c create 4/16/12 11:00:54 AM PDT session 0x36bc50622f0004 cxid 0x19 zxid 0x40000009d create CRC doesn't match 3908382071 vs 1875708430 skipping to next entry Skipped 181778 bytes 4/17/12 12:03:30 AM PDT session 0x236bc633856000f cxid 0x0 zxid 0x500000330 closeSession 4/17/12 12:03:30 AM PDT session 0x236bc6338560005 cxid 0x0 zxid 0x500000331 closeSession 4/17/12 12:03:32 AM PDT session 0x236bc6338560004 cxid 0x0 zxid 0x500000332 closeSession which corresponds to this in the server log: 2012-04-17 00:03:47,077 - WARN [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Learner@306] - Truncating log to get in sync with the leader 0x50000032f 2012-04-17 00:03:47,166 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FileSnap@82] - Reading snapshot /data1/zookeeper/version-2/snapshot.400000156 I also applied the same code that's fixed in ZOOKEEPER-1156 , in it's unfixed state (ie what's in 3.3.3) to one of the log files from one of the other servers. When I do this, applying the same flawed truncation operation, I see that the calculated offset is very close to the location above (in terms of the zxid that has been chopped/corrupted). This further convinces me that ZOOKEEPER-1156 is at fault here. As a result of this over-truncation we are missing a number of records, including the session close record which would normally cause the ephemeral node identified by Danny to be removed.
          Hide
          Patrick Hunt added a comment -

          That said, there is a second problem I uncovered while diagnosing this issue.

          The reason why no errors are in the log4j log (vs the LogFormatter which does it's own iterating and crc checks) is due to some buggy error handling in the FileTxnIterator. The "next" method catches EOF exceptions, in some cases this means we've read all there is to read, however in this case the readTxnBytes is returning null due to the record being truncated. So we never get to the crc check, there is no "next log" (given there is only one txnlog). At this point next has cleared out a number of variables, but not "logStream". As a result the server fails to notice it had a corrupted log, and it has a logStream that will append to the original end of the file (see my previous comment - the corrupted record, then we pick up again after ~180k bytes with zxid 0x500000330).

          We should fix this issue so that corrupted logs are detected correctly, I've opened ZOOKEEPER-1453 to track this issue.

          Danny, if this fixes your problem please close this jira. Thanks for the report!

          Show
          Patrick Hunt added a comment - That said, there is a second problem I uncovered while diagnosing this issue. The reason why no errors are in the log4j log (vs the LogFormatter which does it's own iterating and crc checks) is due to some buggy error handling in the FileTxnIterator. The "next" method catches EOF exceptions, in some cases this means we've read all there is to read, however in this case the readTxnBytes is returning null due to the record being truncated. So we never get to the crc check, there is no "next log" (given there is only one txnlog). At this point next has cleared out a number of variables, but not "logStream". As a result the server fails to notice it had a corrupted log, and it has a logStream that will append to the original end of the file (see my previous comment - the corrupted record, then we pick up again after ~180k bytes with zxid 0x500000330). We should fix this issue so that corrupted logs are detected correctly, I've opened ZOOKEEPER-1453 to track this issue. Danny, if this fixes your problem please close this jira. Thanks for the report!
          Patrick Hunt made changes -
          Assignee Patrick Hunt [ phunt ]
          Hide
          Daniel Lord added a comment -

          I'm running 3.3.5 in a few of our labs now. I haven't had time to check on their log files to see if there are similar numbers of quorum events in the past and that the issue doesn't reproduce at all. I'll check the logs and let things run a few more days to see if the issue goes away.

          Show
          Daniel Lord added a comment - I'm running 3.3.5 in a few of our labs now. I haven't had time to check on their log files to see if there are similar numbers of quorum events in the past and that the issue doesn't reproduce at all. I'll check the logs and let things run a few more days to see if the issue goes away.
          Hide
          Daniel Lord added a comment -

          The ops team has stabilized our labs and so I haven't had any leader election events in about a week. Pat is confident that we're looking at the same issue as the log truncation jira and that sounds reasonable to me.

          Show
          Daniel Lord added a comment - The ops team has stabilized our labs and so I haven't had any leader election events in about a week. Pat is confident that we're looking at the same issue as the log truncation jira and that sounds reasonable to me.
          Daniel Lord made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Cannot Reproduce [ 5 ]
          Hide
          Patrick Hunt added a comment -

          Famous last words. Please reopen if you do see it again. Thx.

          Show
          Patrick Hunt added a comment - Famous last words. Please reopen if you do see it again. Thx.
          Germán Blanco made changes -
          Resolution Cannot Reproduce [ 5 ]
          Status Resolved [ 5 ] Reopened [ 4 ]
          Hide
          Germán Blanco added a comment -

          I would really like to see the logs for this JIRA.
          I have run into a very similar problem recently with zookeeper 3.4.5, but all I have now are the snapshot files. Transaction logs and log files have been removed automatically.
          I will dump the snapshots to a text file and change a few ids and post them here.

          Show
          Germán Blanco added a comment - I would really like to see the logs for this JIRA. I have run into a very similar problem recently with zookeeper 3.4.5, but all I have now are the snapshot files. Transaction logs and log files have been removed automatically. I will dump the snapshots to a text file and change a few ids and post them here.
          Hide
          Germán Blanco added a comment -

          Snapshots of the three members of the ZooKeeper ensemble.

          Show
          Germán Blanco added a comment - Snapshots of the three members of the ZooKeeper ensemble.
          Germán Blanco made changes -
          Attachment snaps.tar [ 12606062 ]
          Hide
          Germán Blanco added a comment -

          The 8 missing nodes in "the follower that is not ok" were created in the end of epoch 1:
          < cZxid = 0x0000010000007d
          ...
          < cZxid = 0x000001000000a9
          while the complete list is:
          ...
          cZxid = 0x0000010000007b
          cZxid = 0x0000010000007d
          ...
          cZxid = 0x000001000000a9
          cZxid = 0x00000200000004
          ...
          4 of the 6 ephemeral owners of these nodes have made modifications during epoch 2, which makes me think that this problem might not be related with session expiration, but more likely with synchronization after leader election.
          Even though some of the missing znodes were modified in epoch 2, "the follower that is not ok" didn't use this event to notice that something was wrong and e.g. restart and synchronize via snapshot.

          Show
          Germán Blanco added a comment - The 8 missing nodes in "the follower that is not ok" were created in the end of epoch 1: < cZxid = 0x0000010000007d ... < cZxid = 0x000001000000a9 while the complete list is: ... cZxid = 0x0000010000007b cZxid = 0x0000010000007d ... cZxid = 0x000001000000a9 cZxid = 0x00000200000004 ... 4 of the 6 ephemeral owners of these nodes have made modifications during epoch 2, which makes me think that this problem might not be related with session expiration, but more likely with synchronization after leader election. Even though some of the missing znodes were modified in epoch 2, "the follower that is not ok" didn't use this event to notice that something was wrong and e.g. restart and synchronize via snapshot.
          Hide
          Patrick Hunt added a comment -

          Germán Blanco please close this and open a new issue. It's fine to link back to this item, but really we should use a clean slate for new investigations/fixes/etc... Thanks.

          Show
          Patrick Hunt added a comment - Germán Blanco please close this and open a new issue. It's fine to link back to this item, but really we should use a clean slate for new investigations/fixes/etc... Thanks.
          Hide
          Germán Blanco added a comment -

          Wrongly reopened. creating a new issue instead.

          Show
          Germán Blanco added a comment - Wrongly reopened. creating a new issue instead.
          Germán Blanco made changes -
          Status Reopened [ 4 ] Resolved [ 5 ]
          Resolution Cannot Reproduce [ 5 ]
          Germán Blanco made changes -
          Link This issue is related to ZOOKEEPER-1777 [ ZOOKEEPER-1777 ]

            People

            • Assignee:
              Patrick Hunt
              Reporter:
              Daniel Lord
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development