ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-1367

Data inconsistencies and unexpired ephemeral nodes after cluster restart

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 3.4.2
    • Fix Version/s: 3.4.3, 3.3.5, 3.5.0
    • Component/s: server
    • Labels:
      None
    • Environment:

      Debian Squeeze, 64-bit

    • Hadoop Flags:
      Reviewed
    • Release Note:
      Fix Data inconsistencies and unexpired ephemeral nodes after cluster restart.

      Description

      In one of our tests, we have a cluster of three ZooKeeper servers. We kill all three, and then restart just two of them. Sometimes we notice that on one of the restarted servers, ephemeral nodes from previous sessions do not get deleted, while on the other server they do. We are effectively running 3.4.2, though technically we are running 3.4.1 with the patch manually applied for ZOOKEEPER-1333 and a C client for 3.4.1 with the patches for ZOOKEEPER-1163.

      I noticed that when I connected using zkCli.sh to the first node (90.0.0.221, zkid 84), I saw only one znode in a particular path:

      [zk: 90.0.0.221:2888(CONNECTED) 0] ls /election/zkrsm
      [nominee0000000011]
      [zk: 90.0.0.221:2888(CONNECTED) 1] get /election/zkrsm/nominee0000000011
      90.0.0.222:7777
      cZxid = 0x400000027
      ctime = Thu Jan 19 08:18:24 UTC 2012
      mZxid = 0x400000027
      mtime = Thu Jan 19 08:18:24 UTC 2012
      pZxid = 0x400000027
      cversion = 0
      dataVersion = 0
      aclVersion = 0
      ephemeralOwner = 0xa234f4f3bc220001
      dataLength = 16
      numChildren = 0

      However, when I connect zkCli.sh to the second server (90.0.0.222, zkid 251), I saw three znodes under that same path:

      [zk: 90.0.0.222:2888(CONNECTED) 2] ls /election/zkrsm
      nominee0000000006 nominee0000000010 nominee0000000011
      [zk: 90.0.0.222:2888(CONNECTED) 2] get /election/zkrsm/nominee0000000011
      90.0.0.222:7777
      cZxid = 0x400000027
      ctime = Thu Jan 19 08:18:24 UTC 2012
      mZxid = 0x400000027
      mtime = Thu Jan 19 08:18:24 UTC 2012
      pZxid = 0x400000027
      cversion = 0
      dataVersion = 0
      aclVersion = 0
      ephemeralOwner = 0xa234f4f3bc220001
      dataLength = 16
      numChildren = 0
      [zk: 90.0.0.222:2888(CONNECTED) 3] get /election/zkrsm/nominee0000000010
      90.0.0.221:7777
      cZxid = 0x30000014c
      ctime = Thu Jan 19 07:53:42 UTC 2012
      mZxid = 0x30000014c
      mtime = Thu Jan 19 07:53:42 UTC 2012
      pZxid = 0x30000014c
      cversion = 0
      dataVersion = 0
      aclVersion = 0
      ephemeralOwner = 0xa234f4f3bc220000
      dataLength = 16
      numChildren = 0
      [zk: 90.0.0.222:2888(CONNECTED) 4] get /election/zkrsm/nominee0000000006
      90.0.0.223:7777
      cZxid = 0x200000cab
      ctime = Thu Jan 19 08:00:30 UTC 2012
      mZxid = 0x200000cab
      mtime = Thu Jan 19 08:00:30 UTC 2012
      pZxid = 0x200000cab
      cversion = 0
      dataVersion = 0
      aclVersion = 0
      ephemeralOwner = 0x5434f5074e040002
      dataLength = 16
      numChildren = 0

      These never went away for the lifetime of the server, for any clients connected directly to that server. Note that this cluster is configured to have all three servers still, the third one being down (90.0.0.223, zkid 162).

      I captured the data/snapshot directories for the the two live servers. When I start single-node servers using each directory, I can briefly see that the inconsistent data is present in those logs, though the ephemeral nodes seem to get (correctly) cleaned up pretty soon after I start the server.

      I will upload a tar containing the debug logs and data directories from the failure. I think we can reproduce it regularly if you need more info.

      1. ZOOKEEPER-1367.tgz
        5.06 MB
        Jeremy Stribling
      2. ZOOKEEPER-1367.patch
        15 kB
        Benjamin Reed
      3. ZOOKEEPER-1367.patch
        16 kB
        Benjamin Reed
      4. ZOOKEEPER-1367-3.4.patch
        16 kB
        Benjamin Reed
      5. 1367-3.3.patch
        8 kB
        Ted Yu
      6. ZOOKEEPER-1367-3.3.patch
        12 kB
        Benjamin Reed

        Activity

        Hide
        Jeremy Stribling added a comment -

        Contains logs and data/snapshot dirs for 90.0.0.221 and 90.0.0.222.

        Show
        Jeremy Stribling added a comment - Contains logs and data/snapshot dirs for 90.0.0.221 and 90.0.0.222.
        Hide
        Camille Fournier added a comment -

        I'll take a look this weekend unless someone's on it now.

        Show
        Camille Fournier added a comment - I'll take a look this weekend unless someone's on it now.
        Hide
        Ted Dunning added a comment -

        This sort of issue is usually a configuration bug. Can you post your configs as well?

        Show
        Ted Dunning added a comment - This sort of issue is usually a configuration bug. Can you post your configs as well?
        Hide
        Jeremy Stribling added a comment -

        There are no config file – we embed Zookeeper into our application framework and set up things through code. I'm not sure what you're looking for exactly – I've given the IPs and ZK ids of the servers above. Besides that, we set up specific client, election, and server ports (those shouldn't matter), and put the snapshot and log directories together in the same place (I know, I know – we're working on teasing them apart, but it's not currently a bottleneck for our application). Tick time and election algorithm are the default. Clients request a session timeout of 1 second, which means they get the default timeout of 2 times the default tick time. Let me know if you are asking about some other config value.

        I certainly wouldn't be surprised if it's a problem with our use of Zookeeper, since we use it in a non-traditional way, but this scenario was so simple that nothing jumps to mind immediately that could cause it.

        Show
        Jeremy Stribling added a comment - There are no config file – we embed Zookeeper into our application framework and set up things through code. I'm not sure what you're looking for exactly – I've given the IPs and ZK ids of the servers above. Besides that, we set up specific client, election, and server ports (those shouldn't matter), and put the snapshot and log directories together in the same place (I know, I know – we're working on teasing them apart, but it's not currently a bottleneck for our application). Tick time and election algorithm are the default. Clients request a session timeout of 1 second, which means they get the default timeout of 2 times the default tick time. Let me know if you are asking about some other config value. I certainly wouldn't be surprised if it's a problem with our use of Zookeeper, since we use it in a non-traditional way, but this scenario was so simple that nothing jumps to mind immediately that could cause it.
        Hide
        Ted Dunning added a comment -

        Jeremy,

        How do you absolutely verify that all three nodes know about all three of the other nodes and that they have the correct zkid's?

        This isn't a matter of what your code looks like on the face, or what the intent is. The question is how do you ask each node's ZK instance about the configuration?

        Show
        Ted Dunning added a comment - Jeremy, How do you absolutely verify that all three nodes know about all three of the other nodes and that they have the correct zkid's? This isn't a matter of what your code looks like on the face, or what the intent is. The question is how do you ask each node's ZK instance about the configuration?
        Hide
        Jeremy Stribling added a comment -

        I'd like to avoid getting into the exact algorithmic details about our code, but the nodes share their local information to a central component via RPC that then updates each server with the full list, before any of them even start ZK in the first place. We print this info out in another log, and I've verified that it's the same on the two live nodes before ZK starts. That configuration info is:

        "90.0.0.223"
        client_port: 2888
        server_port: 2878
        election_port: 3888
        server_id: 162
        uuid: "69dfc0f4-d2b7-4ee4-9dee-cf2bde2e386f"

        "90.0.0.221"
        client_port: 2888
        server_port: 2878
        election_port: 3888
        server_id: 84
        uuid: "2df0e0f4-612b-43b9-8871-52f601468577"

        "90.0.0.222"
        client_port: 2888
        server_port: 2878
        election_port: 3888
        server_id: 251
        uuid: "0e24e45e-7939-46b5-85b8-63919fab03b8"

        Furthermore, we were running this exact same way with 3.3.3 for more than a year, and this particular test has always passed. So unless we were doing something bad before and 3.4 now enforces it, or some new requirements crept into 3.4, I'm pretty confident in saying that our setup is consistent across both live nodes.

        Show
        Jeremy Stribling added a comment - I'd like to avoid getting into the exact algorithmic details about our code, but the nodes share their local information to a central component via RPC that then updates each server with the full list, before any of them even start ZK in the first place. We print this info out in another log, and I've verified that it's the same on the two live nodes before ZK starts. That configuration info is: "90.0.0.223" client_port: 2888 server_port: 2878 election_port: 3888 server_id: 162 uuid: "69dfc0f4-d2b7-4ee4-9dee-cf2bde2e386f" "90.0.0.221" client_port: 2888 server_port: 2878 election_port: 3888 server_id: 84 uuid: "2df0e0f4-612b-43b9-8871-52f601468577" "90.0.0.222" client_port: 2888 server_port: 2878 election_port: 3888 server_id: 251 uuid: "0e24e45e-7939-46b5-85b8-63919fab03b8" Furthermore, we were running this exact same way with 3.3.3 for more than a year, and this particular test has always passed. So unless we were doing something bad before and 3.4 now enforces it, or some new requirements crept into 3.4, I'm pretty confident in saying that our setup is consistent across both live nodes.
        Hide
        Ted Dunning added a comment -

        IF this were my system, I would interpret what you just said as "we know what we are feeding to ZK", but I would not say "we know that ZK got the right config".

        Your point about the difference in behavior between 3.3 and 3.4 is very important, but we still need to lock down where the difference is. Camille will be looking at the logs, but we still need to verify that the configs got to the right place.

        Show
        Ted Dunning added a comment - IF this were my system, I would interpret what you just said as "we know what we are feeding to ZK", but I would not say "we know that ZK got the right config". Your point about the difference in behavior between 3.3 and 3.4 is very important, but we still need to lock down where the difference is. Camille will be looking at the logs, but we still need to verify that the configs got to the right place.
        Hide
        Mahadev konar added a comment -

        @Camille,
        Great. Ill also dig in during the weekend. Hopefully you'll find it before I take a look .

        Show
        Mahadev konar added a comment - @Camille, Great. Ill also dig in during the weekend. Hopefully you'll find it before I take a look .
        Hide
        Camille Fournier added a comment -

        Jeremy pretty much always brings us good bugs, Ted, I don't think he's wasting our time.

        Jeremy, these logs are from the point at which the cluster is running with two members and 221 doesn't have the nodes, but 222 does, correct?

        I'm noticing that in the log files I don't see a close session transaction for the session that created /election/zkrsm/nominee0000000010. Just verifying, the cluster is accepting write requests and client connections successfully at the point you captured these logs right?

        Show
        Camille Fournier added a comment - Jeremy pretty much always brings us good bugs, Ted, I don't think he's wasting our time. Jeremy, these logs are from the point at which the cluster is running with two members and 221 doesn't have the nodes, but 222 does, correct? I'm noticing that in the log files I don't see a close session transaction for the session that created /election/zkrsm/nominee0000000010. Just verifying, the cluster is accepting write requests and client connections successfully at the point you captured these logs right?
        Hide
        Camille Fournier added a comment -

        So I pulled up a cluster on my local machine using these logs, and the two machines in my cluster did expire correctly all the ephemeral nodes you show in the errors. I'm going to assume that when you bring up a 2 node cluster with your setup and these data directories, you see the bad ephemeral nodes, correct? If so, can you try doing it with the latest 3.4.2 jar and see if it still happens?

        Show
        Camille Fournier added a comment - So I pulled up a cluster on my local machine using these logs, and the two machines in my cluster did expire correctly all the ephemeral nodes you show in the errors. I'm going to assume that when you bring up a 2 node cluster with your setup and these data directories, you see the bad ephemeral nodes, correct? If so, can you try doing it with the latest 3.4.2 jar and see if it still happens?
        Hide
        Jeremy Stribling added a comment -

        That's correct Camille, 221 has what I think is the correct state – specifically that nominee0000000006 and nominee0000000010 has been deleted. I'm pretty sure those were ephemeral nodes that were created before the cluster was restarted – hence the sessions that created them did not connect to the cluster during the session that these logs show. Unfortunately I don't have the logs from the run of the cluster where these nodes were created.

        Another thing I should mention about the logs is that they are just the first 100 MB or so of the session. Our logs roll over after 100 MB, and in the tgz I only includes the first 100MB of the cluster's lifetime. I didn't think anything particularly interesting happened after that, but if you need the rest let me know.

        My clients were definitely successfully connected to node 222, and as far as I can tell it was accepting writes (I think nominee0000000011 is from a successful write during this cluster run). For example, /zkrsm/cpt-0000000000000000-0000000000000000/0000000000000000_record0000004054 was created successfully.

        Show
        Jeremy Stribling added a comment - That's correct Camille, 221 has what I think is the correct state – specifically that nominee0000000006 and nominee0000000010 has been deleted. I'm pretty sure those were ephemeral nodes that were created before the cluster was restarted – hence the sessions that created them did not connect to the cluster during the session that these logs show. Unfortunately I don't have the logs from the run of the cluster where these nodes were created. Another thing I should mention about the logs is that they are just the first 100 MB or so of the session. Our logs roll over after 100 MB, and in the tgz I only includes the first 100MB of the cluster's lifetime. I didn't think anything particularly interesting happened after that, but if you need the rest let me know. My clients were definitely successfully connected to node 222, and as far as I can tell it was accepting writes (I think nominee0000000011 is from a successful write during this cluster run). For example, /zkrsm/cpt-0000000000000000-0000000000000000/0000000000000000_record0000004054 was created successfully.
        Hide
        Jeremy Stribling added a comment -

        So I pulled up a cluster on my local machine using these logs, and the two machines in my cluster did expire correctly all the ephemeral nodes you show in the errors. I'm going to assume that when you bring up a 2 node cluster with your setup and these data directories, you see the bad ephemeral nodes, correct? If so, can you try doing it with the latest 3.4.2 jar and see if it still happens?

        I haven't tried it with a 2-node cluster yet. I tried a 1-node cluster with each of the data directories, and yes I noticed that they both successfully deleted the ephemerals after a very short period of time. I saw the bad ones only very briefly before they were deleted (as I would expect). I can try a 2-node cluster with the latest jar (I will if I get some time later today), but I suspect that it will also behave correctly. I think this session covered by the logs was the exception.

        By "doing it", do you mean re-running the whole test with the latest 3.4.2 jar? Is there something fixed besides 1333 that you suspect could have been the culprit?

        Show
        Jeremy Stribling added a comment - So I pulled up a cluster on my local machine using these logs, and the two machines in my cluster did expire correctly all the ephemeral nodes you show in the errors. I'm going to assume that when you bring up a 2 node cluster with your setup and these data directories, you see the bad ephemeral nodes, correct? If so, can you try doing it with the latest 3.4.2 jar and see if it still happens? I haven't tried it with a 2-node cluster yet. I tried a 1-node cluster with each of the data directories, and yes I noticed that they both successfully deleted the ephemerals after a very short period of time. I saw the bad ones only very briefly before they were deleted (as I would expect). I can try a 2-node cluster with the latest jar (I will if I get some time later today), but I suspect that it will also behave correctly. I think this session covered by the logs was the exception. By "doing it", do you mean re-running the whole test with the latest 3.4.2 jar? Is there something fixed besides 1333 that you suspect could have been the culprit?
        Hide
        Camille Fournier added a comment -

        Hmmm I must be confused. I thought that the test you were running resulted in these the cluster in this setup, with the two nodes running and a third down, with these data directories. But if I start the cluster with two nodes and these data directories, the sessions immediately expire and delete those nodes. On the other hand, in the logs I don't see any evidence of session expiration for the sessions holding the ephemerals on either machine. When you get into this situation, if you bounce the cluster again with the two nodes does it fix the problem?

        I don't know if there's anything in 3.4.2 without checking but it seems like a worthwhile sanity check to do.

        Show
        Camille Fournier added a comment - Hmmm I must be confused. I thought that the test you were running resulted in these the cluster in this setup, with the two nodes running and a third down, with these data directories. But if I start the cluster with two nodes and these data directories, the sessions immediately expire and delete those nodes. On the other hand, in the logs I don't see any evidence of session expiration for the sessions holding the ephemerals on either machine. When you get into this situation, if you bounce the cluster again with the two nodes does it fix the problem? I don't know if there's anything in 3.4.2 without checking but it seems like a worthwhile sanity check to do.
        Hide
        Jeremy Stribling added a comment -

        Hmmm I must be confused. I thought that the test you were running resulted in these the cluster in this setup, with the two nodes running and a third down, with these data directories. But if I start the cluster with two nodes and these data directories, the sessions immediately expire and delete those nodes. On the other hand, in the logs I don't see any evidence of session expiration for the sessions holding the ephemerals on either machine. When you get into this situation, if you bounce the cluster again with the two nodes does it fix the problem?

        Very sorry for not being clear. I didn't capture the data directories for the system until after it was already in this state, so these data directories probably aren't exactly the same as the one the system started with when it got into a bad state (though they likely contain a superset of the logs/snapshots that were present at the time). Later, when I manually tried to restart the system (in a 1-node configuration) using the data directories, I saw the same thing you did – the ephemerals went away correctly.

        I did attempt bouncing the system live. The setup is being run by a different team in my company (QA); I will try to get them to reproduce with straight 3.4.2, but it will take a few days. I do suspect that bouncing it would fix the problem, since my manual restarting it later fixes things.

        Show
        Jeremy Stribling added a comment - Hmmm I must be confused. I thought that the test you were running resulted in these the cluster in this setup, with the two nodes running and a third down, with these data directories. But if I start the cluster with two nodes and these data directories, the sessions immediately expire and delete those nodes. On the other hand, in the logs I don't see any evidence of session expiration for the sessions holding the ephemerals on either machine. When you get into this situation, if you bounce the cluster again with the two nodes does it fix the problem? Very sorry for not being clear. I didn't capture the data directories for the system until after it was already in this state, so these data directories probably aren't exactly the same as the one the system started with when it got into a bad state (though they likely contain a superset of the logs/snapshots that were present at the time). Later, when I manually tried to restart the system (in a 1-node configuration) using the data directories, I saw the same thing you did – the ephemerals went away correctly. I did attempt bouncing the system live. The setup is being run by a different team in my company (QA); I will try to get them to reproduce with straight 3.4.2, but it will take a few days. I do suspect that bouncing it would fix the problem, since my manual restarting it later fixes things.
        Hide
        Patrick Hunt added a comment -

        Some initial thoughts, things I hadn't seen mentioned already in this thread:

        One thing to note - znodes 6 10 and 11 are each created in a different epoch (2,3,4 respectively). So in each case leadership must have changed hands. Also note each is a different ephemeralOwner session id.

        Also notice the ctime of each of these znodes - znode 10 was created against a leader that's 10-20 minutes behind the other server(s). That said, it's not likely this is causing the issue but it is a variable. Might be good for you to rule it out by running ntp or something to update all the hosts.

        Anything special about these server hosts? For example are they virtualized? Anything else you can think of out of the ordinary - ordinary being the servers are run on dedicated non-virtualized hosts.

        Are you using multi for this or the pre-3.4 api only? Ie did you upgrade to 3.4, start using multi, then see this issue? Or just update to 3.4 from 3.3 with essentially no changes and see it start happening?

        If you use 4letterwords to query the server (use 'dump' in this case) when you're in this bad state, what does it say about the ephemeralOwner session id of the ephemeral znode that shouldn't be there but is? Run this against each of the live servers.

        Show
        Patrick Hunt added a comment - Some initial thoughts, things I hadn't seen mentioned already in this thread: One thing to note - znodes 6 10 and 11 are each created in a different epoch (2,3,4 respectively). So in each case leadership must have changed hands. Also note each is a different ephemeralOwner session id. Also notice the ctime of each of these znodes - znode 10 was created against a leader that's 10-20 minutes behind the other server(s). That said, it's not likely this is causing the issue but it is a variable. Might be good for you to rule it out by running ntp or something to update all the hosts. Anything special about these server hosts? For example are they virtualized? Anything else you can think of out of the ordinary - ordinary being the servers are run on dedicated non-virtualized hosts. Are you using multi for this or the pre-3.4 api only? Ie did you upgrade to 3.4, start using multi, then see this issue? Or just update to 3.4 from 3.3 with essentially no changes and see it start happening? If you use 4letterwords to query the server (use 'dump' in this case) when you're in this bad state, what does it say about the ephemeralOwner session id of the ephemeral znode that shouldn't be there but is? Run this against each of the live servers.
        Hide
        Jeremy Stribling added a comment -

        Thanks very much for your thoughts. Inline:

        One thing to note - znodes 6 10 and 11 are each created in a different epoch (2,3,4 respectively). So in each case leadership must have changed hands. Also note each is a different ephemeralOwner session id.

        Yeah, that's not surprising. Due to the nature of our ZooKeeper embedding, we need to restart all the ZK servers every time a new node comes up, probably leading to new internal ZK elections.

        Also notice the ctime of each of these znodes - znode 10 was created against a leader that's 10-20 minutes behind the other server(s). That said, it's not likely this is causing the issue but it is a variable. Might be good for you to rule it out by running ntp or something to update all the hosts.

        That's a good catch, and probably something I should have mentioned earlier. I thought ZK didn't care much about wall-clock time, but obviously it would make the logs harder to read. Here are the relative times of the 3 servers during the test:

        90.0.0.221: Thu Jan 19 18:42:00 UTC 2012 (base)
        90.0.0.222: Thu Jan 19 18:26:55 UTC 2012 (-15:05)
        90.0.0.223: Thu Jan 19 18:17:50 UTC 2012 (-24:10)
        

        Could that affect ZK correctness?

        Anything special about these server hosts? For example are they virtualized? Anything else you can think of out of the ordinary - ordinary being the servers are run on dedicated non-virtualized hosts.

        As they run as part of our automated QA testing, they are virtualized 3- or 5- to a box on ESX. Again, this hasn't changed since we were using 3.3.3, but I understand it's not the desired setup. We runon bare metal in production, this is just for testing.

        Are you using multi for this or the pre-3.4 api only? Ie did you upgrade to 3.4, start using multi, then see this issue? Or just update to 3.4 from 3.3 with essentially no changes and see it start happening?

        Pre-3.4 api only, no multi commands. No changes to our application code after the upgrade.

        If you use 4letterwords to query the server (use 'dump' in this case) when you're in this bad state, what does it say about the ephemeralOwner session id of the ephemeral znode that shouldn't be there but is? Run this against each of the live servers.

        I have not tried this. If I can wrangle the QA team into re-running the test, I will give this a try and report back.

        Show
        Jeremy Stribling added a comment - Thanks very much for your thoughts. Inline: One thing to note - znodes 6 10 and 11 are each created in a different epoch (2,3,4 respectively). So in each case leadership must have changed hands. Also note each is a different ephemeralOwner session id. Yeah, that's not surprising. Due to the nature of our ZooKeeper embedding, we need to restart all the ZK servers every time a new node comes up, probably leading to new internal ZK elections. Also notice the ctime of each of these znodes - znode 10 was created against a leader that's 10-20 minutes behind the other server(s). That said, it's not likely this is causing the issue but it is a variable. Might be good for you to rule it out by running ntp or something to update all the hosts. That's a good catch, and probably something I should have mentioned earlier. I thought ZK didn't care much about wall-clock time, but obviously it would make the logs harder to read. Here are the relative times of the 3 servers during the test: 90.0.0.221: Thu Jan 19 18:42:00 UTC 2012 (base) 90.0.0.222: Thu Jan 19 18:26:55 UTC 2012 (-15:05) 90.0.0.223: Thu Jan 19 18:17:50 UTC 2012 (-24:10) Could that affect ZK correctness? Anything special about these server hosts? For example are they virtualized? Anything else you can think of out of the ordinary - ordinary being the servers are run on dedicated non-virtualized hosts. As they run as part of our automated QA testing, they are virtualized 3- or 5- to a box on ESX. Again, this hasn't changed since we were using 3.3.3, but I understand it's not the desired setup. We runon bare metal in production, this is just for testing. Are you using multi for this or the pre-3.4 api only? Ie did you upgrade to 3.4, start using multi, then see this issue? Or just update to 3.4 from 3.3 with essentially no changes and see it start happening? Pre-3.4 api only, no multi commands. No changes to our application code after the upgrade. If you use 4letterwords to query the server (use 'dump' in this case) when you're in this bad state, what does it say about the ephemeralOwner session id of the ephemeral znode that shouldn't be there but is? Run this against each of the live servers. I have not tried this. If I can wrangle the QA team into re-running the test, I will give this a try and report back.
        Hide
        Patrick Hunt added a comment -

        we need to restart all the ZK servers

        Hm..... very interesting. What exactly does this mean? You mentioned earlier that you "embed Zookeeper into our application framework and set up things through code" how exactly are you performing this "restart". Is ZK a separate process, are you killing processes, or are you calling some code to effect this? I ask because we really don't support this and I'm wondering if that could be related.

        If I wanted to setup a junit test how might I go about doing it? If you could provide some insight it might help in reproducing.

        I thought ZK didn't care much about wall-clock time

        that's true, but the more variables we can eliminate the more easy it will be to track the real issue down.

        I have not tried this. If I can wrangle the QA team into re-running the test, I will give this a try and report back.

        I suspect that the session is not getting cleared until after restarting the cluster, but this would help to verify (I need to dig into your attachment, planning to look at that next).

        Show
        Patrick Hunt added a comment - we need to restart all the ZK servers Hm..... very interesting. What exactly does this mean? You mentioned earlier that you "embed Zookeeper into our application framework and set up things through code" how exactly are you performing this "restart". Is ZK a separate process, are you killing processes, or are you calling some code to effect this? I ask because we really don't support this and I'm wondering if that could be related. If I wanted to setup a junit test how might I go about doing it? If you could provide some insight it might help in reproducing. I thought ZK didn't care much about wall-clock time that's true, but the more variables we can eliminate the more easy it will be to track the real issue down. I have not tried this. If I can wrangle the QA team into re-running the test, I will give this a try and report back. I suspect that the session is not getting cleared until after restarting the cluster, but this would help to verify (I need to dig into your attachment, planning to look at that next).
        Hide
        Patrick Hunt added a comment -

        btw, if you do have QA retest this please do capture all the logs (log4j). Unfortunately the two logs don't both show the znode expiring (I see the time in question in one but not the other log), that would give much more insight into what happened...

        Show
        Patrick Hunt added a comment - btw, if you do have QA retest this please do capture all the logs (log4j). Unfortunately the two logs don't both show the znode expiring (I see the time in question in one but not the other log), that would give much more insight into what happened...
        Hide
        Jeremy Stribling added a comment -

        Hm..... very interesting. What exactly does this mean? You mentioned earlier that you "embed Zookeeper into our application framework and set up things through code" how exactly are you performing this "restart". Is ZK a separate process, are you killing processes, or are you calling some code to effect this? I ask because we really don't support this and I'm wondering if that could be related.

        ZK is embedded into a java process, running alongside some other java apps we need to manage. The reason for this is that dynamic cluster membership is absolutely required for our application; we cannot know the IPs/ports/server IDs of all of the Zookeeper servers that will exist in the cluster. So as new nodes come online, they connect to a centralized part of our service, and we distribute the new list of servers to all the existing servers, so they can restart themselves. By "restart" here, I mean we call QuorumPeer.shutdown (and FastleaderElection.shutdown), delete the previous QuorumPeer, and construct a new one with the new configuration.

        This is the same way we ran things under 3.3.3. I understand that this is not officially supported, but in my heart of hearts I don't believe it is related to the bug at hand, so I appreciate your indulgence on the matter.

        that's true, but the more variables we can eliminate the more easy it will be to track the real issue down.

        We are supposed to be running with synced clocks, but QA is trying to track down a bug with their system right now to figure out why NTP isn't working in their environment. Sorry for the extra level of confusion.

        btw, if you do have QA retest this please do capture all the logs (log4j). Unfortunately the two logs don't both show the znode expiring (I see the time in question in one but not the other log), that would give much more insight into what happened...

        I don't quite follow your question. Do you mean that my logs aren't capturing all of the output logged by ZooKeeper? Or are you asking for the logs from the previous run of the system, when the znodes were originally created? I will definitely try to get the latter, but as for the former problem – this is everything that ZooKeeper logged, so I'm not sure what else I can capture during the run.

        I think the very problem is that one of the nodes didn't expire one of the sessions, so I wouldn't expect that to be in the log for that server. But maybe I don't quite understand what you're asking.

        Show
        Jeremy Stribling added a comment - Hm..... very interesting. What exactly does this mean? You mentioned earlier that you "embed Zookeeper into our application framework and set up things through code" how exactly are you performing this "restart". Is ZK a separate process, are you killing processes, or are you calling some code to effect this? I ask because we really don't support this and I'm wondering if that could be related. ZK is embedded into a java process, running alongside some other java apps we need to manage. The reason for this is that dynamic cluster membership is absolutely required for our application; we cannot know the IPs/ports/server IDs of all of the Zookeeper servers that will exist in the cluster. So as new nodes come online, they connect to a centralized part of our service, and we distribute the new list of servers to all the existing servers, so they can restart themselves. By "restart" here, I mean we call QuorumPeer.shutdown (and FastleaderElection.shutdown), delete the previous QuorumPeer, and construct a new one with the new configuration. This is the same way we ran things under 3.3.3. I understand that this is not officially supported, but in my heart of hearts I don't believe it is related to the bug at hand, so I appreciate your indulgence on the matter. that's true, but the more variables we can eliminate the more easy it will be to track the real issue down. We are supposed to be running with synced clocks, but QA is trying to track down a bug with their system right now to figure out why NTP isn't working in their environment. Sorry for the extra level of confusion. btw, if you do have QA retest this please do capture all the logs (log4j). Unfortunately the two logs don't both show the znode expiring (I see the time in question in one but not the other log), that would give much more insight into what happened... I don't quite follow your question. Do you mean that my logs aren't capturing all of the output logged by ZooKeeper? Or are you asking for the logs from the previous run of the system, when the znodes were originally created? I will definitely try to get the latter, but as for the former problem – this is everything that ZooKeeper logged, so I'm not sure what else I can capture during the run. I think the very problem is that one of the nodes didn't expire one of the sessions, so I wouldn't expect that to be in the log for that server. But maybe I don't quite understand what you're asking.
        Hide
        Benjamin Reed added a comment -

        jeremy, with this embedding, do you ever change the set of machines you are using? like do you ever go from using machines A,B,C to machines A,B,C,D,E?

        Show
        Benjamin Reed added a comment - jeremy, with this embedding, do you ever change the set of machines you are using? like do you ever go from using machines A,B,C to machines A,B,C,D,E?
        Hide
        Jeremy Stribling added a comment -

        jeremy, with this embedding, do you ever change the set of machines you are using? like do you ever go from using machines A,B,C to machines A,B,C,D,E?

        Yes, though we only add or remove one node at a time. So when the system first comes up, we actually do

        {A}

        ->

        {A, B}

        ->

        {A, B, C}

        . I know the first step is potentially problematic because there's no majority from the previous cluster present in the new cluster, but we make sure A comes up first. Small comfort I know, but that's the best we've been able to do until real dynamic cluster membership is available.

        Show
        Jeremy Stribling added a comment - jeremy, with this embedding, do you ever change the set of machines you are using? like do you ever go from using machines A,B,C to machines A,B,C,D,E? Yes, though we only add or remove one node at a time. So when the system first comes up, we actually do {A} -> {A, B} -> {A, B, C} . I know the first step is potentially problematic because there's no majority from the previous cluster present in the new cluster, but we make sure A comes up first. Small comfort I know, but that's the best we've been able to do until real dynamic cluster membership is available.
        Hide
        Patrick Hunt added a comment -

        I just noticed something that doesn't look good:

        I started a 2 server ensemble, connected a client, then stopped/started the leader. I noticed that:

        1) the follower does not create a snapshot after reconnecting to the new leader (the same leader is re-elected to be the leader in epoch 3 as was lead in epoch 2)
        2) the follower continues to write changes from epoch 3 into the log from epoch 2.

        java -cp '/home/phunt/Downloads/zk1367/jars/*' org.apache.zookeeper.server.LogFormatter log.200000001 
        ZooKeeper Transactional Log File with dbid 0 txnlog format version 2
        1/23/12 4:39:45 PM PST session 0x1350d25cc790000 cxid 0x0 zxid 0x200000001 createSession 30000
        
        1/23/12 4:41:02 PM PST session 0x1350d25cc790000 cxid 0x2 zxid 0x300000001 create '/foo,#626172,v{s{31,s{'world,'anyone}}},F,1
        
        1/23/12 4:44:10 PM PST session 0x1350d25cc790000 cxid 0x4 zxid 0x300000002 closeSession null
        1/23/12 4:44:26 PM PST session 0x1350d283e690000 cxid 0x0 zxid 0x300000003 createSession 30000
        
        1/23/12 4:44:43 PM PST session 0x1350d283e690000 cxid 0x2 zxid 0x300000004 closeSession null
        1/23/12 4:44:51 PM PST session 0x2350d283e5c0000 cxid 0x0 zxid 0x300000005 createSession 30000
        
        1/23/12 4:45:10 PM PST session 0x2350d283e5c0000 cxid 0x2 zxid 0x300000006 closeSession null
        1/23/12 4:45:17 PM PST session 0x1350d283e690001 cxid 0x0 zxid 0x300000007 createSession 30000
        
        EOF reached after 8 txns.
        
        Show
        Patrick Hunt added a comment - I just noticed something that doesn't look good: I started a 2 server ensemble, connected a client, then stopped/started the leader. I noticed that: 1) the follower does not create a snapshot after reconnecting to the new leader (the same leader is re-elected to be the leader in epoch 3 as was lead in epoch 2) 2) the follower continues to write changes from epoch 3 into the log from epoch 2. java -cp '/home/phunt/Downloads/zk1367/jars/*' org.apache.zookeeper.server.LogFormatter log.200000001 ZooKeeper Transactional Log File with dbid 0 txnlog format version 2 1/23/12 4:39:45 PM PST session 0x1350d25cc790000 cxid 0x0 zxid 0x200000001 createSession 30000 1/23/12 4:41:02 PM PST session 0x1350d25cc790000 cxid 0x2 zxid 0x300000001 create '/foo,#626172,v{s{31,s{'world,'anyone}}},F,1 1/23/12 4:44:10 PM PST session 0x1350d25cc790000 cxid 0x4 zxid 0x300000002 closeSession null 1/23/12 4:44:26 PM PST session 0x1350d283e690000 cxid 0x0 zxid 0x300000003 createSession 30000 1/23/12 4:44:43 PM PST session 0x1350d283e690000 cxid 0x2 zxid 0x300000004 closeSession null 1/23/12 4:44:51 PM PST session 0x2350d283e5c0000 cxid 0x0 zxid 0x300000005 createSession 30000 1/23/12 4:45:10 PM PST session 0x2350d283e5c0000 cxid 0x2 zxid 0x300000006 closeSession null 1/23/12 4:45:17 PM PST session 0x1350d283e690001 cxid 0x0 zxid 0x300000007 createSession 30000 EOF reached after 8 txns.
        Hide
        Patrick Hunt added a comment -

        Hm, maybe not an issue - the new epoch tracking tracking code seemed to handle this ok. I also tried shutting down, deleting the old leader datadir, and restarting and the new leader tracked the changes properly...

        Show
        Patrick Hunt added a comment - Hm, maybe not an issue - the new epoch tracking tracking code seemed to handle this ok. I also tried shutting down, deleting the old leader datadir, and restarting and the new leader tracked the changes properly...
        Hide
        Patrick Hunt added a comment -

        fwiw I did file ZOOKEEPER-1372 which does seems like a valid issue, although not related to this afaik

        Show
        Patrick Hunt added a comment - fwiw I did file ZOOKEEPER-1372 which does seems like a valid issue, although not related to this afaik
        Hide
        Benjamin Reed added a comment -

        so when you say that you only add or remove one node at a time. does that mean you:

        1) have a running cluster
        2) take it done
        3) change the configuration to add a node
        4) start the cluster back up
        5) wait until a leader becomes active

        you only add or remove the next steps after you have completed all 5 steps for the previous configuration?

        Show
        Benjamin Reed added a comment - so when you say that you only add or remove one node at a time. does that mean you: 1) have a running cluster 2) take it done 3) change the configuration to add a node 4) start the cluster back up 5) wait until a leader becomes active you only add or remove the next steps after you have completed all 5 steps for the previous configuration?
        Hide
        Jeremy Stribling added a comment -

        so when you say that you only add or remove one node at a time. does that mean you:

        1) have a running cluster
        2) take it done
        3) change the configuration to add a node
        4) start the cluster back up
        5) wait until a leader becomes active

        you only add or remove the next steps after you have completed all 5 steps for the previous configuration?

        That's more or less correct, except that steps 2-5 are being executed in parallel on all the nodes to speed things up, so they are not proceeding in lockstep across the cluster. We wait for all nodes to complete the 5 steps before moving on to the next node, if there is one.

        I am still working on getting this reproduced by QA. They said they've fixed the NTP issues, so hopefully the new logs will be easier to work with if we can get it to happen again.

        Show
        Jeremy Stribling added a comment - so when you say that you only add or remove one node at a time. does that mean you: 1) have a running cluster 2) take it done 3) change the configuration to add a node 4) start the cluster back up 5) wait until a leader becomes active you only add or remove the next steps after you have completed all 5 steps for the previous configuration? That's more or less correct, except that steps 2-5 are being executed in parallel on all the nodes to speed things up, so they are not proceeding in lockstep across the cluster. We wait for all nodes to complete the 5 steps before moving on to the next node, if there is one. I am still working on getting this reproduced by QA. They said they've fixed the NTP issues, so hopefully the new logs will be easier to work with if we can get it to happen again.
        Hide
        Patrick Hunt added a comment -

        Looks like Ben is on to something here. I investigated this more and found the following:

        First off the problem statement is not correct - it's not that sessions are expiring/closing and ephemerals are hanging around on some servers. What's really happening is that the session is not being closed/expired at all. I can see in the txn logs where the session is created, but I never see the session closed. This goes for the sessions for both znodes 6 and 10. As far as the servers know the session is created but never explicitly closed/expired. (also the znode itself is never explicitly deleted via a client)

        see this log msg for the location I'm talking about:

        2012-01-19 08:18:17,175 30953 [QuorumPeer[myid=84]/90.0.0.221:2888] TRACE org.apache.zookeeper.server.ZooKeeperServer - ZooKeeperServer — killSession: 0xa234f4f3bc220000

        Let's now talk more about ephemeral znode /election/zkrsm/nominee0000000010 owned by session 0xa234f4f3bc220000

        What is happening is that 221 is reading the snapshot/logs when it becomes leader, it does not see session 0xa234f4f3bc220000 at all. It's neither in the snapshot nor in the logs.

        I used LogFormatter/SnapshotFormatter to dump the logs/snaps, the interesting thing is that 221 never sees session 0xa234f4f3bc220000 in it's (session create/close related) txn logs. As far as it knows the session was never created.

        I see this on 221:

        1) log.200000ae7 ends with zxid 0x30000012c

        2) log.300000144.txt (the next log it has after log.200000ae7) starts with 0x300000144

        On 222 I can see

        3) session 0xa234f4f3bc220000 was created at zxid 0x300000134

        1/18/12 11:53:36 PM PST session 0xa234f4f3bc220000 cxid 0x0 zxid 0x300000134 createSession 6000

        So 221 doesn't see the session created because it was not active in the quorum during this time.

        It seems that some side-effect of the restart/reconfig process being used in this test is causing the quorum to get into an inconsistent state.

        Show
        Patrick Hunt added a comment - Looks like Ben is on to something here. I investigated this more and found the following: First off the problem statement is not correct - it's not that sessions are expiring/closing and ephemerals are hanging around on some servers. What's really happening is that the session is not being closed/expired at all. I can see in the txn logs where the session is created, but I never see the session closed. This goes for the sessions for both znodes 6 and 10. As far as the servers know the session is created but never explicitly closed/expired. (also the znode itself is never explicitly deleted via a client) see this log msg for the location I'm talking about: 2012-01-19 08:18:17,175 30953 [QuorumPeer [myid=84] /90.0.0.221:2888] TRACE org.apache.zookeeper.server.ZooKeeperServer - ZooKeeperServer — killSession: 0xa234f4f3bc220000 Let's now talk more about ephemeral znode /election/zkrsm/nominee0000000010 owned by session 0xa234f4f3bc220000 What is happening is that 221 is reading the snapshot/logs when it becomes leader, it does not see session 0xa234f4f3bc220000 at all. It's neither in the snapshot nor in the logs. I used LogFormatter/SnapshotFormatter to dump the logs/snaps, the interesting thing is that 221 never sees session 0xa234f4f3bc220000 in it's (session create/close related) txn logs. As far as it knows the session was never created. I see this on 221: 1) log.200000ae7 ends with zxid 0x30000012c 2) log.300000144.txt (the next log it has after log.200000ae7) starts with 0x300000144 On 222 I can see 3) session 0xa234f4f3bc220000 was created at zxid 0x300000134 1/18/12 11:53:36 PM PST session 0xa234f4f3bc220000 cxid 0x0 zxid 0x300000134 createSession 6000 So 221 doesn't see the session created because it was not active in the quorum during this time. It seems that some side-effect of the restart/reconfig process being used in this test is causing the quorum to get into an inconsistent state.
        Hide
        Patrick Hunt added a comment -

        Jeremy - It will be good if you can reproduce this and provide full logs(log4j) and datadir for the servers. Unfortunately I can't see prior to epoch 3 on 221 (log4j logs trunc). afaict this same issue is happening just prior to 221 becoming the leader as I just described in the comment - it's getting a diff and it's likely that that diff was missing the session detail for some reason, I can't tell why though, or if it might be a bug or just side-effect of how the restart/reconfig is happening. I'd really like to nail it down though.

        Ben/Flavio any insights?

        Show
        Patrick Hunt added a comment - Jeremy - It will be good if you can reproduce this and provide full logs(log4j) and datadir for the servers. Unfortunately I can't see prior to epoch 3 on 221 (log4j logs trunc). afaict this same issue is happening just prior to 221 becoming the leader as I just described in the comment - it's getting a diff and it's likely that that diff was missing the session detail for some reason, I can't tell why though, or if it might be a bug or just side-effect of how the restart/reconfig is happening. I'd really like to nail it down though. Ben/Flavio any insights?
        Hide
        Jeremy Stribling added a comment -

        QA was able to reproduce it. The good news is that NTP is working properly, so the logs should be a bit saner; also, I was able to salvage all the logs for the node that has the incorrectly-unexpired session. The bad news I don't have any usable logs for one of the nodes, and for the other one I only have the most recent logs. I hope this helps at least a little.

        This time around, all three nodes are live when the problem happens. There are three nodes: 90.0.0.1/208/node1, 90.0.0.2/30/node2, 90.0.0.3/119/node3. node2 is the one in the bad state:

        [zk: 90.0.0.2:2888(CONNECTED) 0] ls /election/zkrsm
        [nominee0000000034, nominee0000000035, nominee0000000032, nominee0000000036]
        [zk: 90.0.0.2:2888(CONNECTED) 1] get /election/zkrsm/nominee0000000032 
        90.0.0.1:7777 
        cZxid = 0xe00000017
        ctime = Wed Jan 25 20:53:15 UTC 2012
        mZxid = 0xe00000017
        mtime = Wed Jan 25 20:53:15 UTC 2012
        pZxid = 0xe00000017
        cversion = 0
        dataVersion = 0
        aclVersion = 0
        ephemeralOwner = 0x1e3516a4bb770000
        dataLength = 14
        numChildren = 0
        

        /election/zkrsm/nominee0000000032 is owned by a session (0x1e3516a4bb770000) that should be expired. Connecting to the other two nodes show only the latest three znodes.

        I have uploaded a tar file containing the datadirs of all three nodes, the logs for all runs of node2, and the logs of the latest run of node3, to:
        http://pdos.lcs.mit.edu/~strib/ZOOKEEPER-1367-2.tgz (80 MB zipped, 2.8 GB unzipped)

        Here's how the logs break down:

        • run0.tgz: You can ignore this one, it's from a preliminary run that was not part of the cluster
        • run1.tgz: This is where the cluster is built up using restarts. In node2's case, it is first started with two nodes (node1 and node2). Then that QuorumPeer is shutdown, and restarted with a configuration of all three nodes.
        • run[2-8].tgz: Each represents a cluster restart. No cluster re-configurations are done during this time. Our whole application process is brought down and then back up using the same ZK configuration.
        • run7.tgz: This is the run where the bad session was created and live:
          2012-01-25 20:53:09,948 505755 [ProcessThread(sid:119 cport:-1):] TRACE org.apache.zookeeper.server.PrepRequestProcessor  - :Psessionid:0x1e3516a4bb770000 type:createSession cxid:0x0 zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a
          
        • run8.tgz: The session created in the previous run of the system lives on, and is never expired as far as I can tell:
          2012-01-25 20:53:27,950 2929 [pool-1-thread-2] TRACE org.apache.zookeeper.server.persistence.FileTxnSnapLog  - playLog --- create session in log: 1e3516a4bb770000 with timeout: 6000
          

        I have the matching run8.log for node3, but it does not mention the bad session at all. I have no usable logs for node1.

        I also managed to do a dump against each node while the system was live. Here are the results:

        # telnet 90.0.0.1 2888
        Trying 90.0.0.1...
        Connected to 90.0.0.1.
        Escape character is '^]'.
        dump
        SessionTracker dump:
        Session Sets (3):
        0 expire at Thu Jan 26 00:09:30 UTC 2012:
        1 expire at Thu Jan 26 00:09:33 UTC 2012:
        	0x773516a5076a0000
        2 expire at Thu Jan 26 00:09:36 UTC 2012:
        	0xd03516a4f7960000
        	0x1e3516a4f8520001
        ephemeral nodes dump:
        Sessions with Ephemerals (3):
        0x773516a5076a0000:
        	/election/zk_live_members/nominee0000000035
        	/election/role-cluster-mgmt_cluster/nominee0000000035
        	/election/cassandra/nominee0000000036
        	/election/zkrsm/nominee0000000034
        0xd03516a4f7960000:
        	/election/role-cluster-api_provider/nominee0000000033
        	/election/connection-cluster/nominee0000000033
        	/election/zkrsm/nominee0000000036
        	/election/zk_live_members/nominee0000000036
        	/election/role-cluster-dht_node/nominee0000000034
        	/election/role-cluster-mgmt_cluster/nominee0000000036
        	/election/role-cluster-switch_manager/nominee0000000033
        	/election/DHT/nominee0000000033
        	/election/dht-reconciliation/nominee0000000034
        	/election/cassandra/nominee0000000034
        	/election/role-cluster-persistence_server/nominee0000000034
        	/election/role-cluster-logical_manager/nominee0000000033
        0x1e3516a4f8520001:
        	/election/zkrsm/nominee0000000035
        	/election/zk_live_members/nominee0000000034
        	/election/role-cluster-mgmt_cluster/nominee0000000034
        	/election/cassandra/nominee0000000035
        
        
        # telnet 90.0.0.2 2888
        Trying 90.0.0.2...
        Connected to 90.0.0.2.
        Escape character is '^]'.
        dump
        SessionTracker dump:
        org.apache.zookeeper.server.quorum.LearnerSessionTracker@1479a6b
        ephemeral nodes dump:
        Sessions with Ephemerals (4):
        0x1e3516a4bb770000:
        	/election/cassandra/nominee0000000032
        	/election/role-cluster-mgmt_cluster/nominee0000000032
        	/election/role-cluster-persistence_server/nominee0000000032
        	/election/role-cluster-dht_node/nominee0000000032
        	/election/dht-reconciliation/nominee0000000032
        	/election/zkrsm/nominee0000000032
        	/election/zk_live_members/nominee0000000032
        0x773516a5076a0000:
        	/election/zk_live_members/nominee0000000035
        	/election/role-cluster-mgmt_cluster/nominee0000000035
        	/election/cassandra/nominee0000000036
        	/election/zkrsm/nominee0000000034
        0xd03516a4f7960000:
        	/election/role-cluster-api_provider/nominee0000000033
        	/election/connection-cluster/nominee0000000033
        	/election/zkrsm/nominee0000000036
        	/election/zk_live_members/nominee0000000036
        	/election/role-cluster-dht_node/nominee0000000034
        	/election/role-cluster-mgmt_cluster/nominee0000000036
        	/election/role-cluster-switch_manager/nominee0000000033
        	/election/DHT/nominee0000000033
        	/election/dht-reconciliation/nominee0000000034
        	/election/cassandra/nominee0000000034
        	/election/role-cluster-persistence_server/nominee0000000034
        	/election/role-cluster-logical_manager/nominee0000000033
        0x1e3516a4f8520001:
        	/election/zkrsm/nominee0000000035
        	/election/zk_live_members/nominee0000000034
        	/election/role-cluster-mgmt_cluster/nominee0000000034
        	/election/cassandra/nominee0000000035
        
        
        # telnet 90.0.0.3 2888
        Trying 90.0.0.3...
        Connected to 90.0.0.3.
        Escape character is '^]'.
        dump
        SessionTracker dump:
        org.apache.zookeeper.server.quorum.LearnerSessionTracker@5960a2f1
        ephemeral nodes dump:
        Sessions with Ephemerals (3):
        0x773516a5076a0000:
        	/election/zk_live_members/nominee0000000035
        	/election/role-cluster-mgmt_cluster/nominee0000000035
        	/election/cassandra/nominee0000000036
        	/election/zkrsm/nominee0000000034
        0xd03516a4f7960000:
        	/election/role-cluster-api_provider/nominee0000000033
        	/election/connection-cluster/nominee0000000033
        	/election/zkrsm/nominee0000000036
        	/election/zk_live_members/nominee0000000036
        	/election/role-cluster-dht_node/nominee0000000034
        	/election/role-cluster-mgmt_cluster/nominee0000000036
        	/election/role-cluster-switch_manager/nominee0000000033
        	/election/DHT/nominee0000000033
        	/election/dht-reconciliation/nominee0000000034
        	/election/cassandra/nominee0000000034
        	/election/role-cluster-persistence_server/nominee0000000034
        	/election/role-cluster-logical_manager/nominee0000000033
        0x1e3516a4f8520001:
        	/election/zkrsm/nominee0000000035
        	/election/zk_live_members/nominee0000000034
        	/election/role-cluster-mgmt_cluster/nominee0000000034
        	/election/cassandra/nominee0000000035
        

        The system is still in that state right now; if anyone needs more info from the live system, please ask me for it in the next hour or so.

        Show
        Jeremy Stribling added a comment - QA was able to reproduce it. The good news is that NTP is working properly, so the logs should be a bit saner; also, I was able to salvage all the logs for the node that has the incorrectly-unexpired session. The bad news I don't have any usable logs for one of the nodes, and for the other one I only have the most recent logs. I hope this helps at least a little. This time around, all three nodes are live when the problem happens. There are three nodes: 90.0.0.1/208/node1, 90.0.0.2/30/node2, 90.0.0.3/119/node3. node2 is the one in the bad state: [zk: 90.0.0.2:2888(CONNECTED) 0] ls /election/zkrsm [nominee0000000034, nominee0000000035, nominee0000000032, nominee0000000036] [zk: 90.0.0.2:2888(CONNECTED) 1] get /election/zkrsm/nominee0000000032 90.0.0.1:7777 cZxid = 0xe00000017 ctime = Wed Jan 25 20:53:15 UTC 2012 mZxid = 0xe00000017 mtime = Wed Jan 25 20:53:15 UTC 2012 pZxid = 0xe00000017 cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x1e3516a4bb770000 dataLength = 14 numChildren = 0 /election/zkrsm/nominee0000000032 is owned by a session (0x1e3516a4bb770000) that should be expired. Connecting to the other two nodes show only the latest three znodes. I have uploaded a tar file containing the datadirs of all three nodes, the logs for all runs of node2, and the logs of the latest run of node3, to: http://pdos.lcs.mit.edu/~strib/ZOOKEEPER-1367-2.tgz (80 MB zipped, 2.8 GB unzipped) Here's how the logs break down: run0.tgz: You can ignore this one, it's from a preliminary run that was not part of the cluster run1.tgz: This is where the cluster is built up using restarts. In node2's case, it is first started with two nodes (node1 and node2). Then that QuorumPeer is shutdown, and restarted with a configuration of all three nodes. run [2-8] .tgz: Each represents a cluster restart. No cluster re-configurations are done during this time. Our whole application process is brought down and then back up using the same ZK configuration. run7.tgz: This is the run where the bad session was created and live: 2012-01-25 20:53:09,948 505755 [ProcessThread(sid:119 cport:-1):] TRACE org.apache.zookeeper.server.PrepRequestProcessor - :Psessionid:0x1e3516a4bb770000 type:createSession cxid:0x0 zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a run8.tgz: The session created in the previous run of the system lives on, and is never expired as far as I can tell: 2012-01-25 20:53:27,950 2929 [pool-1-thread-2] TRACE org.apache.zookeeper.server.persistence.FileTxnSnapLog - playLog --- create session in log: 1e3516a4bb770000 with timeout: 6000 I have the matching run8.log for node3, but it does not mention the bad session at all. I have no usable logs for node1. I also managed to do a dump against each node while the system was live. Here are the results: # telnet 90.0.0.1 2888 Trying 90.0.0.1... Connected to 90.0.0.1. Escape character is '^]'. dump SessionTracker dump: Session Sets (3): 0 expire at Thu Jan 26 00:09:30 UTC 2012: 1 expire at Thu Jan 26 00:09:33 UTC 2012: 0x773516a5076a0000 2 expire at Thu Jan 26 00:09:36 UTC 2012: 0xd03516a4f7960000 0x1e3516a4f8520001 ephemeral nodes dump: Sessions with Ephemerals (3): 0x773516a5076a0000: /election/zk_live_members/nominee0000000035 /election/role-cluster-mgmt_cluster/nominee0000000035 /election/cassandra/nominee0000000036 /election/zkrsm/nominee0000000034 0xd03516a4f7960000: /election/role-cluster-api_provider/nominee0000000033 /election/connection-cluster/nominee0000000033 /election/zkrsm/nominee0000000036 /election/zk_live_members/nominee0000000036 /election/role-cluster-dht_node/nominee0000000034 /election/role-cluster-mgmt_cluster/nominee0000000036 /election/role-cluster-switch_manager/nominee0000000033 /election/DHT/nominee0000000033 /election/dht-reconciliation/nominee0000000034 /election/cassandra/nominee0000000034 /election/role-cluster-persistence_server/nominee0000000034 /election/role-cluster-logical_manager/nominee0000000033 0x1e3516a4f8520001: /election/zkrsm/nominee0000000035 /election/zk_live_members/nominee0000000034 /election/role-cluster-mgmt_cluster/nominee0000000034 /election/cassandra/nominee0000000035 # telnet 90.0.0.2 2888 Trying 90.0.0.2... Connected to 90.0.0.2. Escape character is '^]'. dump SessionTracker dump: org.apache.zookeeper.server.quorum.LearnerSessionTracker@1479a6b ephemeral nodes dump: Sessions with Ephemerals (4): 0x1e3516a4bb770000: /election/cassandra/nominee0000000032 /election/role-cluster-mgmt_cluster/nominee0000000032 /election/role-cluster-persistence_server/nominee0000000032 /election/role-cluster-dht_node/nominee0000000032 /election/dht-reconciliation/nominee0000000032 /election/zkrsm/nominee0000000032 /election/zk_live_members/nominee0000000032 0x773516a5076a0000: /election/zk_live_members/nominee0000000035 /election/role-cluster-mgmt_cluster/nominee0000000035 /election/cassandra/nominee0000000036 /election/zkrsm/nominee0000000034 0xd03516a4f7960000: /election/role-cluster-api_provider/nominee0000000033 /election/connection-cluster/nominee0000000033 /election/zkrsm/nominee0000000036 /election/zk_live_members/nominee0000000036 /election/role-cluster-dht_node/nominee0000000034 /election/role-cluster-mgmt_cluster/nominee0000000036 /election/role-cluster-switch_manager/nominee0000000033 /election/DHT/nominee0000000033 /election/dht-reconciliation/nominee0000000034 /election/cassandra/nominee0000000034 /election/role-cluster-persistence_server/nominee0000000034 /election/role-cluster-logical_manager/nominee0000000033 0x1e3516a4f8520001: /election/zkrsm/nominee0000000035 /election/zk_live_members/nominee0000000034 /election/role-cluster-mgmt_cluster/nominee0000000034 /election/cassandra/nominee0000000035 # telnet 90.0.0.3 2888 Trying 90.0.0.3... Connected to 90.0.0.3. Escape character is '^]'. dump SessionTracker dump: org.apache.zookeeper.server.quorum.LearnerSessionTracker@5960a2f1 ephemeral nodes dump: Sessions with Ephemerals (3): 0x773516a5076a0000: /election/zk_live_members/nominee0000000035 /election/role-cluster-mgmt_cluster/nominee0000000035 /election/cassandra/nominee0000000036 /election/zkrsm/nominee0000000034 0xd03516a4f7960000: /election/role-cluster-api_provider/nominee0000000033 /election/connection-cluster/nominee0000000033 /election/zkrsm/nominee0000000036 /election/zk_live_members/nominee0000000036 /election/role-cluster-dht_node/nominee0000000034 /election/role-cluster-mgmt_cluster/nominee0000000036 /election/role-cluster-switch_manager/nominee0000000033 /election/DHT/nominee0000000033 /election/dht-reconciliation/nominee0000000034 /election/cassandra/nominee0000000034 /election/role-cluster-persistence_server/nominee0000000034 /election/role-cluster-logical_manager/nominee0000000033 0x1e3516a4f8520001: /election/zkrsm/nominee0000000035 /election/zk_live_members/nominee0000000034 /election/role-cluster-mgmt_cluster/nominee0000000034 /election/cassandra/nominee0000000035 The system is still in that state right now; if anyone needs more info from the live system, please ask me for it in the next hour or so.
        Hide
        Benjamin Reed added a comment -

        can you tell me the server id's of each of the different servers?

        also in the data directories the accepted and current epochs are all 15b, but your zxid's all have epochs way less than that. are those left over files from another run?

        Show
        Benjamin Reed added a comment - can you tell me the server id's of each of the different servers? also in the data directories the accepted and current epochs are all 15b, but your zxid's all have epochs way less than that. are those left over files from another run?
        Hide
        Jeremy Stribling added a comment -

        can you tell me the server id's of each of the different servers?

        Sorry if that wasn't clear. The server ID is the second element for each node in this list from my previous comment:

        90.0.0.1/208/node1, 90.0.0.2/30/node2, 90.0.0.3/119/node3

        also in the data directories the accepted and current epochs are all 15b, but your zxid's all have epochs way less than that. are those left over files from another run?

        No, I just copied the entire version-2 directory from the live, running system. Not sure why they'd be inconsistent with the data in the recent snapshots and logs.

        Show
        Jeremy Stribling added a comment - can you tell me the server id's of each of the different servers? Sorry if that wasn't clear. The server ID is the second element for each node in this list from my previous comment: 90.0.0.1/208/node1, 90.0.0.2/30/node2, 90.0.0.3/119/node3 also in the data directories the accepted and current epochs are all 15b, but your zxid's all have epochs way less than that. are those left over files from another run? No, I just copied the entire version-2 directory from the live, running system. Not sure why they'd be inconsistent with the data in the recent snapshots and logs.
        Hide
        Flavio Junqueira added a comment -

        I'm a bit confused, I don't understand the following in run7.log, node 2:

        2012-01-25 20:53:06,895 502702 [QuorumPeer[myid=119]/90.0.0.2:2888] TRACE org.apache.zookeeper.server.persistence.FileTxnSnapLog  - playLog --- create session in log: 1e35169d27b00000 with timeout: 6000
        

        and

        2012-01-25 20:53:09,948 505755 [ProcessThread(sid:119 cport:-1):] TRACE org.apache.zookeeper.server.PrepRequestProcessor  - :Psessionid:0x1e3516a4bb770000 type:createSession cxid:0x0 zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a
        

        Is it creating the same session twice? I'm not convinced that I'm not overlooking anything obvious.

        Show
        Flavio Junqueira added a comment - I'm a bit confused, I don't understand the following in run7.log, node 2: 2012-01-25 20:53:06,895 502702 [QuorumPeer[myid=119]/90.0.0.2:2888] TRACE org.apache.zookeeper.server.persistence.FileTxnSnapLog - playLog --- create session in log: 1e35169d27b00000 with timeout: 6000 and 2012-01-25 20:53:09,948 505755 [ProcessThread(sid:119 cport:-1):] TRACE org.apache.zookeeper.server.PrepRequestProcessor - :Psessionid:0x1e3516a4bb770000 type:createSession cxid:0x0 zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a Is it creating the same session twice? I'm not convinced that I'm not overlooking anything obvious.
        Hide
        Benjamin Reed added a comment -

        never mind. it wasn't 15b, it was 15, which does match the epoch of the highest zxid.

        Show
        Benjamin Reed added a comment - never mind. it wasn't 15b, it was 15, which does match the epoch of the highest zxid.
        Hide
        Benjamin Reed added a comment -

        one thing that is throwing me is the abundance of -110 (node exists) errors! on recovery they are normal, but they should never happen during normal operation.

        Show
        Benjamin Reed added a comment - one thing that is throwing me is the abundance of -110 (node exists) errors! on recovery they are normal, but they should never happen during normal operation.
        Hide
        Benjamin Reed added a comment -

        run8.log for 90.0.0.1 would be super useful! also it seems that run7.log is chopped in the beginning for 90.0.0.1, why is that happening? the other thing i don't understand is the version-2 directory for 90.0.0.1 doesn't have any log or snapshot for epoch f. were you having file system problems on 90.0.0.1?

        Show
        Benjamin Reed added a comment - run8.log for 90.0.0.1 would be super useful! also it seems that run7.log is chopped in the beginning for 90.0.0.1, why is that happening? the other thing i don't understand is the version-2 directory for 90.0.0.1 doesn't have any log or snapshot for epoch f. were you having file system problems on 90.0.0.1?
        Hide
        Flavio Junqueira added a comment -

        I haven't parsed the session ids properly, so my previous observation was dull.

        On run8.log from 90.0.0.2, we can see that it adds the session (1e3516a4bb770000) to the sessions list (see FileTxnSnapLog), and it got it from its own transaction log. But, the leader (90.0.0.1) supposedly knows of that session as well, otherwise it was not committed or leader election didn't select the right server. Checking the leader election notification messages, I can't see any problem. The part about the leader being aware of that session so that it can recreate it is the one we can't verify because we don't have run8.log for 90.0.0.1.

        Also, the file run7.log for 90.0.0.1 is chopped, so we can't see what happened in that server when the session was created (2012-01-25 20:53:09,948 according to 90.0.0.2). By inspecting the transaction logs of 90.0.0.1, I couldn't find a session creation for 1e3516a4bb770000, but it does appear in the description of some transactions in log.e00000016. It sounds like it has been transferred via a snapshot?

        Show
        Flavio Junqueira added a comment - I haven't parsed the session ids properly, so my previous observation was dull. On run8.log from 90.0.0.2, we can see that it adds the session (1e3516a4bb770000) to the sessions list (see FileTxnSnapLog), and it got it from its own transaction log. But, the leader (90.0.0.1) supposedly knows of that session as well, otherwise it was not committed or leader election didn't select the right server. Checking the leader election notification messages, I can't see any problem. The part about the leader being aware of that session so that it can recreate it is the one we can't verify because we don't have run8.log for 90.0.0.1. Also, the file run7.log for 90.0.0.1 is chopped, so we can't see what happened in that server when the session was created (2012-01-25 20:53:09,948 according to 90.0.0.2). By inspecting the transaction logs of 90.0.0.1, I couldn't find a session creation for 1e3516a4bb770000, but it does appear in the description of some transactions in log.e00000016. It sounds like it has been transferred via a snapshot?
        Hide
        Jeremy Stribling added a comment -

        Yes, sorry, 90.0.0.1 was filling up its disk at a fast rate, and we have scripts that clean out old logs and runs the Zookeeper purge script when we get within 90% of disk capacity. This is why it is missing older snapshots and why I said we don't have any usable logs for it.

        The run7.log that's there for 90.0.0.1 should actually be called run8.log, but it is just the most recent log file after all the older ones have been rotated away due to the disk getting close to capacity.

        I will keep working on reproducing it and get into the system before the disks start filling up so I can get a more complete set of data for you. I have been unable to reproduce it on simple setups myself, so I'm still relying on our QA tests which comes with all this annoying baggage.

        Show
        Jeremy Stribling added a comment - Yes, sorry, 90.0.0.1 was filling up its disk at a fast rate, and we have scripts that clean out old logs and runs the Zookeeper purge script when we get within 90% of disk capacity. This is why it is missing older snapshots and why I said we don't have any usable logs for it. The run7.log that's there for 90.0.0.1 should actually be called run8.log, but it is just the most recent log file after all the older ones have been rotated away due to the disk getting close to capacity. I will keep working on reproducing it and get into the system before the disks start filling up so I can get a more complete set of data for you. I have been unable to reproduce it on simple setups myself, so I'm still relying on our QA tests which comes with all this annoying baggage.
        Hide
        Patrick Hunt added a comment -

        FYI, I created ZOOKEEPER-1377 just so I could look at the snapshot files on this issue. You might find it useful.

        Show
        Patrick Hunt added a comment - FYI, I created ZOOKEEPER-1377 just so I could look at the snapshot files on this issue. You might find it useful.
        Hide
        Patrick Hunt added a comment -

        Jeremy - can the cleanup scripts be modified to gzip the old logs rather than rm'ing them? That might allow you to provide us with the full logs for reproducing the issue.

        Show
        Patrick Hunt added a comment - Jeremy - can the cleanup scripts be modified to gzip the old logs rather than rm'ing them? That might allow you to provide us with the full logs for reproducing the issue.
        Hide
        Camille Fournier added a comment -

        Debugging notes:

        The session that server 2 sees that server 1 is missing is created at zxid 0xe00000001, right after
        it assumes leadership with server 3.
        Server 1 then joins later, and gets a diff of transactions from zxid e0...1 to 15. It does seem to record the other
        transactions in this batch. However, it doesn't record a create session txn, because data tree processTxn doesn't do
        anything for createSession txns. So now, the session is live on server 2. Writes go through it to all servers.
        At some point the cluster is restarted and server 1 becomes the leader. It doesn't know about the session that was created so it never expires it. What I'm not sure about is how it manages to get removed from server 3... perhaps via a trunc? Or perhaps it isn't removed, just not showing up as owned by an ephemeral. I don't have the time to repro
        to that level right now.
        Anyway, I think the problem is not recording the create session to the txn log.

        Show
        Camille Fournier added a comment - Debugging notes: The session that server 2 sees that server 1 is missing is created at zxid 0xe00000001, right after it assumes leadership with server 3. Server 1 then joins later, and gets a diff of transactions from zxid e0...1 to 15. It does seem to record the other transactions in this batch. However, it doesn't record a create session txn, because data tree processTxn doesn't do anything for createSession txns. So now, the session is live on server 2. Writes go through it to all servers. At some point the cluster is restarted and server 1 becomes the leader. It doesn't know about the session that was created so it never expires it. What I'm not sure about is how it manages to get removed from server 3... perhaps via a trunc? Or perhaps it isn't removed, just not showing up as owned by an ephemeral. I don't have the time to repro to that level right now. Anyway, I think the problem is not recording the create session to the txn log.
        Hide
        Camille Fournier added a comment -

        On run8.log from 90.0.0.2, we can see that it adds the session (1e3516a4bb770000) to the sessions list (see FileTxnSnapLog), and it got it from its own transaction log. But, the leader (90.0.0.1) supposedly knows of that session as well, otherwise it was not committed or leader election didn't select the right server. Checking the leader election notification messages, I can't see any problem. The part about the leader being aware of that session so that it can recreate it is the one we can't verify because we don't have run8.log for 90.0.0.1.

        Server 1 (90.0.0.1) is not the leader at the time that session is created, server 2 is the leader. Server 1 is not even in the quorum at that point, it's just after 2 has gained leadership with 3 as follower.

        Show
        Camille Fournier added a comment - On run8.log from 90.0.0.2, we can see that it adds the session (1e3516a4bb770000) to the sessions list (see FileTxnSnapLog), and it got it from its own transaction log. But, the leader (90.0.0.1) supposedly knows of that session as well, otherwise it was not committed or leader election didn't select the right server. Checking the leader election notification messages, I can't see any problem. The part about the leader being aware of that session so that it can recreate it is the one we can't verify because we don't have run8.log for 90.0.0.1. Server 1 (90.0.0.1) is not the leader at the time that session is created, server 2 is the leader. Server 1 is not even in the quorum at that point, it's just after 2 has gained leadership with 3 as follower.
        Hide
        Benjamin Reed added a comment -

        pat and i just finished up a debug session (using his super cool snapshot dump utility

        we found it (actually i see camille is also there too

        the problem is that the create session happens while server 1 is not connected, but server 1 is not far behind server 2, so when server 1 connects to server 2 it sends a diff. this diff gets processed in Learner.syncWithLeader(). in that method as it processes the commited transactions of the diff it applies them to the dataTree. but the dataTree does not update the session map! thus in final request processor we have:

            rc = zks.getZKDatabase().processTxn(request.getHdr(), request.getTxn());
                        if (request.type == OpCode.createSession) {
                            if (request.getTxn() instanceof CreateSessionTxn) {
                                CreateSessionTxn cst = (CreateSessionTxn) request.getTxn();
                                zks.sessionTracker.addSession(request.sessionId, cst
                                        .getTimeOut());
                            } else {
                                LOG.warn("*****>>>>> Got "
                                        + request.getTxn().getClass() + " "
                                        + request.getTxn().toString());
                            }
                        } else if (request.type == OpCode.closeSession) {
                            zks.sessionTracker.removeSession(request.sessionId);
                        }
        

        because we don't have the special handling for session ops in that method, 1 doesn't track them, so when it becomes the leader the session is missing.

        the fix is rather simple as is the test case. i'll have both up tonight.

        thanx everyone for the help in debugging. and jeremy thanx for providing such great logs and traces and recreating the problem!

        Show
        Benjamin Reed added a comment - pat and i just finished up a debug session (using his super cool snapshot dump utility we found it (actually i see camille is also there too the problem is that the create session happens while server 1 is not connected, but server 1 is not far behind server 2, so when server 1 connects to server 2 it sends a diff. this diff gets processed in Learner.syncWithLeader(). in that method as it processes the commited transactions of the diff it applies them to the dataTree. but the dataTree does not update the session map! thus in final request processor we have: rc = zks.getZKDatabase().processTxn(request.getHdr(), request.getTxn()); if (request.type == OpCode.createSession) { if (request.getTxn() instanceof CreateSessionTxn) { CreateSessionTxn cst = (CreateSessionTxn) request.getTxn(); zks.sessionTracker.addSession(request.sessionId, cst .getTimeOut()); } else { LOG.warn( "*****>>>>> Got " + request.getTxn().getClass() + " " + request.getTxn().toString()); } } else if (request.type == OpCode.closeSession) { zks.sessionTracker.removeSession(request.sessionId); } because we don't have the special handling for session ops in that method, 1 doesn't track them, so when it becomes the leader the session is missing. the fix is rather simple as is the test case. i'll have both up tonight. thanx everyone for the help in debugging. and jeremy thanx for providing such great logs and traces and recreating the problem!
        Hide
        Patrick Hunt added a comment -

        Ben, I'm looking at 3.3 branch and I see the same issue in Learner syncWithLeader... is this a bug in 3.3. as well or are we/I on the wrong track?

        Show
        Patrick Hunt added a comment - Ben, I'm looking at 3.3 branch and I see the same issue in Learner syncWithLeader... is this a bug in 3.3. as well or are we/I on the wrong track?
        Hide
        Mahadev konar added a comment -

        Wow, kudos to Ben/Pat/Camille. Am glad you guys figured it out but I have the same question as Pat, as to why we dont see this in 3.3*. Running your test case on 3.3 branch would be helpful.

        Show
        Mahadev konar added a comment - Wow, kudos to Ben/Pat/Camille. Am glad you guys figured it out but I have the same question as Pat, as to why we dont see this in 3.3*. Running your test case on 3.3 branch would be helpful.
        Hide
        Patrick Hunt added a comment -

        kudos Jeremy for reporting and sticking with us! Much appreciated!

        Show
        Patrick Hunt added a comment - kudos Jeremy for reporting and sticking with us! Much appreciated!
        Hide
        Mahadev konar added a comment -

        Agreed!

        Show
        Mahadev konar added a comment - Agreed!
        Hide
        Jeremy Stribling added a comment -

        Thanks guys! I just now reproduced it with the gzipped log suggestion from Patrick. Anyone interested in more thorough logs?

        Show
        Jeremy Stribling added a comment - Thanks guys! I just now reproduced it with the gzipped log suggestion from Patrick. Anyone interested in more thorough logs?
        Hide
        Patrick Hunt added a comment -

        Jeremy - Ben's working on a reproducible test case now, I'd suggest hold on to it (if you can) for now, but I think we have enough detail at this point.

        It would be great if you could verify the patch once it's available.

        Show
        Patrick Hunt added a comment - Jeremy - Ben's working on a reproducible test case now, I'd suggest hold on to it (if you can) for now, but I think we have enough detail at this point. It would be great if you could verify the patch once it's available.
        Hide
        Jeremy Stribling added a comment -

        Will do, on both counts.

        Show
        Jeremy Stribling added a comment - Will do, on both counts.
        Hide
        Benjamin Reed added a comment -

        ok this should fix the bug and also has a test that reliably reproduces the bug.

        Show
        Benjamin Reed added a comment - ok this should fix the bug and also has a test that reliably reproduces the bug.
        Hide
        Jeremy Stribling added a comment -

        ok this should fix the bug and also has a test that reliably reproduces the bug.

        Awesome, thanks much Ben! In your copious spare time, do you think you could post one that applies cleanly on 3.4.1 or 3.4.2? The patch fails to apply for me to ZooKeeperServer.java and FinalRequestProcessor.java on both releases. I'd like to send it to QA for testing, but I'd rather not take the whole master branch.

        Show
        Jeremy Stribling added a comment - ok this should fix the bug and also has a test that reliably reproduces the bug. Awesome, thanks much Ben! In your copious spare time, do you think you could post one that applies cleanly on 3.4.1 or 3.4.2? The patch fails to apply for me to ZooKeeperServer.java and FinalRequestProcessor.java on both releases. I'd like to send it to QA for testing, but I'd rather not take the whole master branch.
        Hide
        Benjamin Reed added a comment -

        yeah, i'm working on that now. there were a couple of small patches that went in that i have to hand merge.

        Show
        Benjamin Reed added a comment - yeah, i'm working on that now. there were a couple of small patches that went in that i have to hand merge.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12512274/ZOOKEEPER-1367.patch
        against trunk revision 1234974.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/928//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/928//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/928//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12512274/ZOOKEEPER-1367.patch against trunk revision 1234974. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/928//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/928//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/928//console This message is automatically generated.
        Hide
        Mahadev konar added a comment -

        From https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/928//testReport/

        org.apache.zookeeper.server.quorum.LearnerTest.syncTest
        Failing for the past 1 build (Since #928 )
        Took 74 ms.
        Stacktrace
        java.lang.NullPointerException
        	at org.apache.zookeeper.server.quorum.LearnerZooKeeperServer.createSessionTracker(LearnerZooKeeperServer.java:73)
        	at org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:355)
        	at org.apache.zookeeper.server.quorum.LearnerTest.syncTest(LearnerTest.java:114)
        	at org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
        
        Show
        Mahadev konar added a comment - From https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/928//testReport/ org.apache.zookeeper.server.quorum.LearnerTest.syncTest Failing for the past 1 build (Since #928 ) Took 74 ms. Stacktrace java.lang.NullPointerException at org.apache.zookeeper.server.quorum.LearnerZooKeeperServer.createSessionTracker(LearnerZooKeeperServer.java:73) at org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:355) at org.apache.zookeeper.server.quorum.LearnerTest.syncTest(LearnerTest.java:114) at org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
        Hide
        Mahadev konar added a comment -

        @Ben/Jeremy,
        Ill kick off a 3.4.3 release with this patch and ZOOKEEPER-1373.

        Show
        Mahadev konar added a comment - @Ben/Jeremy, Ill kick off a 3.4.3 release with this patch and ZOOKEEPER-1373 .
        Hide
        Benjamin Reed added a comment -

        fixed the LearnerTest (it needed to simulate the zk server a bit more robustly)

        Show
        Benjamin Reed added a comment - fixed the LearnerTest (it needed to simulate the zk server a bit more robustly)
        Hide
        Benjamin Reed added a comment -

        here is the patch for the 3.4 branch

        Show
        Benjamin Reed added a comment - here is the patch for the 3.4 branch
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12512281/ZOOKEEPER-1367-3.4.patch
        against trunk revision 1234974.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 6 new or modified tests.

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/930//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12512281/ZOOKEEPER-1367-3.4.patch against trunk revision 1234974. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/930//console This message is automatically generated.
        Hide
        Ted Yu added a comment -

        Is a patch for 3.3 branch needed ?

        I don't see src/java/test/org/apache/zookeeper/server/quorum/LearnerTest.java under 3.3 branch.

        Show
        Ted Yu added a comment - Is a patch for 3.3 branch needed ? I don't see src/java/test/org/apache/zookeeper/server/quorum/LearnerTest.java under 3.3 branch.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12512280/ZOOKEEPER-1367.patch
        against trunk revision 1234974.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 6 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/929//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/929//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/929//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12512280/ZOOKEEPER-1367.patch against trunk revision 1234974. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/929//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/929//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/929//console This message is automatically generated.
        Hide
        Ted Yu added a comment -

        Patch for 3.3 branch.

        Running tests.

        Since there is no LearnerTest in 3.3, I don't know how we can verify that the problem has been fixed.

        Show
        Ted Yu added a comment - Patch for 3.3 branch. Running tests. Since there is no LearnerTest in 3.3, I don't know how we can verify that the problem has been fixed.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12512285/1367-3.3.patch
        against trunk revision 1234974.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/931//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12512285/1367-3.3.patch against trunk revision 1234974. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/931//console This message is automatically generated.
        Hide
        Ted Yu added a comment -
        BUILD FAILED
        /Users/zhihyu/3.3-zookeeper/build.xml:934: The following error occurred while executing this line:
        /Users/zhihyu/3.3-zookeeper/build.xml:845: Tests failed!
        

        But it is not clear which test failed:

        grep 'Failures:' build/test/logs/TEST-org.apache.zookeeper.* | grep -v 'Failures: 0'
        

        The above command returned nothing.
        There was no hanging test either.

        Show
        Ted Yu added a comment - BUILD FAILED /Users/zhihyu/3.3-zookeeper/build.xml:934: The following error occurred while executing this line: /Users/zhihyu/3.3-zookeeper/build.xml:845: Tests failed! But it is not clear which test failed: grep 'Failures:' build/test/logs/TEST-org.apache.zookeeper.* | grep -v 'Failures: 0' The above command returned nothing. There was no hanging test either.
        Hide
        Benjamin Reed added a comment -

        nice job backporting. i don't think qa will be able to test it properly since it will try to apply to trunk. if you ambitious about writing at test. we need to:

        1. startup a set of three servers
        2. do something like setData
        3. kill one of the followers
        4. startup a client to create a new session
        5. startup the follower
        6. kill the other follower
        7. do a setData
        8. kill the leader
        9. restart the other follower (the follower from step 3 should be elected here)

        if the bug is present, the session will be gone

        Show
        Benjamin Reed added a comment - nice job backporting. i don't think qa will be able to test it properly since it will try to apply to trunk. if you ambitious about writing at test. we need to: startup a set of three servers do something like setData kill one of the followers startup a client to create a new session startup the follower kill the other follower do a setData kill the leader restart the other follower (the follower from step 3 should be elected here) if the bug is present, the session will be gone
        Hide
        Benjamin Reed added a comment -

        if noone writes the test for 3.3, i'll write one tomorrow. (i have a 10 mile race in the morning

        Show
        Benjamin Reed added a comment - if noone writes the test for 3.3, i'll write one tomorrow. (i have a 10 mile race in the morning
        Hide
        Ted Yu added a comment -

        @Benjamin:
        Backporting was the easy part. I am new to zookeeper codebase, so writing the test would take longer.

        Show
        Ted Yu added a comment - @Benjamin: Backporting was the easy part. I am new to zookeeper codebase, so writing the test would take longer.
        Hide
        Ted Yu added a comment -

        The difference between 3.3 and TRUNK API is obvious:

            [javac] /Users/zhihyu/3.3-zookeeper/src/java/test/org/apache/zookeeper/server/quorum/Zab1_0Test.java:263: cannot find symbol
            [javac] symbol  : method getAcceptedEpoch()
            [javac] location: class org.apache.zookeeper.server.quorum.QuorumPeer
            [javac]                     Assert.assertEquals(1, f.self.getAcceptedEpoch());
            [javac]                                                  ^
            [javac] /Users/zhihyu/3.3-zookeeper/src/java/test/org/apache/zookeeper/server/quorum/Zab1_0Test.java:264: cannot find symbol
            [javac] symbol  : method getCurrentEpoch()
            [javac] location: class org.apache.zookeeper.server.quorum.QuorumPeer
            [javac]                     Assert.assertEquals(1, f.self.getCurrentEpoch());
            [javac]                                                  ^
        
        Show
        Ted Yu added a comment - The difference between 3.3 and TRUNK API is obvious: [javac] /Users/zhihyu/3.3-zookeeper/src/java/test/org/apache/zookeeper/server/quorum/Zab1_0Test.java:263: cannot find symbol [javac] symbol : method getAcceptedEpoch() [javac] location: class org.apache.zookeeper.server.quorum.QuorumPeer [javac] Assert.assertEquals(1, f.self.getAcceptedEpoch()); [javac] ^ [javac] /Users/zhihyu/3.3-zookeeper/src/java/test/org/apache/zookeeper/server/quorum/Zab1_0Test.java:264: cannot find symbol [javac] symbol : method getCurrentEpoch() [javac] location: class org.apache.zookeeper.server.quorum.QuorumPeer [javac] Assert.assertEquals(1, f.self.getCurrentEpoch()); [javac] ^
        Hide
        Ted Yu added a comment -

        When I upload patch for trunk onto https://reviews.apache.org, I can clearly see some new whitespace getting introduced.
        Maybe this is not an issue.

        Show
        Ted Yu added a comment - When I upload patch for trunk onto https://reviews.apache.org , I can clearly see some new whitespace getting introduced. Maybe this is not an issue.
        Hide
        Benjamin Reed added a comment -

        good catch. although the added lines with tabs replace lines that also have tabs (as do the surrounding lines). i'm a bit torn as to whether we should fix it in the patch since the surrounding lines will then look messed up.

        Show
        Benjamin Reed added a comment - good catch. although the added lines with tabs replace lines that also have tabs (as do the surrounding lines). i'm a bit torn as to whether we should fix it in the patch since the surrounding lines will then look messed up.
        Hide
        Patrick Hunt added a comment -

        We can clean up the whitespace separately. Goal here is to fix the issue in as clear a way as possible and get a fix out.

        Show
        Patrick Hunt added a comment - We can clean up the whitespace separately. Goal here is to fix the issue in as clear a way as possible and get a fix out.
        Hide
        Camille Fournier added a comment -

        Downloaded the various patches and will try to take a look tomorrow AM assuming internet access (spotty in current location).

        Show
        Camille Fournier added a comment - Downloaded the various patches and will try to take a look tomorrow AM assuming internet access (spotty in current location).
        Hide
        Camille Fournier added a comment -

        Ok I looked at the trunk patch only (sorry but the beach awaits). It looks good to me from a correctness point of view. +1

        Show
        Camille Fournier added a comment - Ok I looked at the trunk patch only (sorry but the beach awaits). It looks good to me from a correctness point of view. +1
        Hide
        Flavio Junqueira added a comment -

        The patch looks good to me, but there is one thing that I'm not entirely clear about, which is why 90.0.0.3 didn't log the session creation. At least, I couldn't find it in the logs and I believe Camille raised a similar concern above.

        Show
        Flavio Junqueira added a comment - The patch looks good to me, but there is one thing that I'm not entirely clear about, which is why 90.0.0.3 didn't log the session creation. At least, I couldn't find it in the logs and I believe Camille raised a similar concern above.
        Hide
        Flavio Junqueira added a comment -

        I got it, 90.0.0.3 did log it in log.d00000001. +1 for the trunk patch, tests pass for me, and I checked that the test fails before the patch and passes after applying the patch.

        Show
        Flavio Junqueira added a comment - I got it, 90.0.0.3 did log it in log.d00000001. +1 for the trunk patch, tests pass for me, and I checked that the test fails before the patch and passes after applying the patch.
        Hide
        Patrick Hunt added a comment -

        In addition to this patch we need to have an option to turn off sending of diffs. There are a couple of really strong reasons I can think of to do this:

        1) 3.3.x is broken in a similar way, there is an upgrade problem we can't solve short of having ppl first upgrade to a fixed 3.3 (3.3.5 say) and then upgrading to 3.4.x. If we could turn off diff sending this would address the problem.

        2) safety valve. Say we find another new problem with diff sending in 3.4/3/5. Having an option to turn it off would be useful for people as a workaround until a fix is found and released.

        Show
        Patrick Hunt added a comment - In addition to this patch we need to have an option to turn off sending of diffs. There are a couple of really strong reasons I can think of to do this: 1) 3.3.x is broken in a similar way, there is an upgrade problem we can't solve short of having ppl first upgrade to a fixed 3.3 (3.3.5 say) and then upgrading to 3.4.x. If we could turn off diff sending this would address the problem. 2) safety valve. Say we find another new problem with diff sending in 3.4/3/5. Having an option to turn it off would be useful for people as a workaround until a fix is found and released.
        Hide
        Ted Yu added a comment -

        Can we address turning off diff sending in another JIRA ?

        This JIRA is an important one and warrants new releases of 3.3 and 3.4, in my opinion.

        Show
        Ted Yu added a comment - Can we address turning off diff sending in another JIRA ? This JIRA is an important one and warrants new releases of 3.3 and 3.4, in my opinion.
        Hide
        Patrick Hunt added a comment -

        Another jira is fine. It would be a shortsighted mistake to release a new version that didn't have an option to turn off the diffs though - see my comment re upgrade.

        Show
        Patrick Hunt added a comment - Another jira is fine. It would be a shortsighted mistake to release a new version that didn't have an option to turn off the diffs though - see my comment re upgrade.
        Hide
        Ted Yu added a comment -

        I logged ZOOKEEPER-1378

        Show
        Ted Yu added a comment - I logged ZOOKEEPER-1378
        Hide
        Patrick Hunt added a comment -

        Thanks Zhihong.

        @Ben - the question asked by Mahadev/myself earlier in this jira never got answered – why are we seeing this in 3.4 but not in 3.3?

        Show
        Patrick Hunt added a comment - Thanks Zhihong. @Ben - the question asked by Mahadev/myself earlier in this jira never got answered – why are we seeing this in 3.4 but not in 3.3?
        Hide
        Camille Fournier added a comment -

        Are we not seeing it in 3.3? It seems to me glancing at the code that we should also be vulnerable to this there.

        Show
        Camille Fournier added a comment - Are we not seeing it in 3.3? It seems to me glancing at the code that we should also be vulnerable to this there.
        Hide
        Patrick Hunt added a comment -

        @camille - Jeremy reported only seeing it once upgrading to 3.4, he hadn't seen it previously in 3.3.x. (That's not to say there isn't a problem, just why the discrepancy)

        Show
        Patrick Hunt added a comment - @camille - Jeremy reported only seeing it once upgrading to 3.4, he hadn't seen it previously in 3.3.x. (That's not to say there isn't a problem, just why the discrepancy)
        Hide
        Jeremy Stribling added a comment -

        We had been running the same test on 3.3.3 for the past year and never saw this failure. Of course, that doesn't mean it wasn't there, but I still think something changed in 3.4, possibly elsewhere in the code, to make it more common.

        Show
        Jeremy Stribling added a comment - We had been running the same test on 3.3.3 for the past year and never saw this failure. Of course, that doesn't mean it wasn't there, but I still think something changed in 3.4, possibly elsewhere in the code, to make it more common.
        Hide
        Henry Robinson added a comment -

        All three patches (trunk, 3.4 and 3.3) look good to me. +1, nice job. I'm comfortable not having the 3.3 test case, although of course it would be good.

        I don't fully understand the reason for this diff:

        @@ -240,8 +242,7 @@
                 // Clean up dead sessions
                 LinkedList<Long> deadSessions = new LinkedList<Long>();
                 for (long session : zkDb.getSessions()) {
        -            sessionsWithTimeouts = zkDb.getSessionWithTimeOuts();
        -            if (sessionsWithTimeouts.get(session) == null) {
        +          if (zkDb.getSessionWithTimeOuts().get(session) == null) {
                         deadSessions.add(session);
                     }
                 }
        

        but if it's just tidying up, that's fine (although it would seem better to lift the getSessionWithTimeOuts call to outside the loop). There's also some extra whitespace and an unused import or two, but those can get cleaned up later.

        Show
        Henry Robinson added a comment - All three patches (trunk, 3.4 and 3.3) look good to me. +1, nice job. I'm comfortable not having the 3.3 test case, although of course it would be good. I don't fully understand the reason for this diff: @@ -240,8 +242,7 @@ // Clean up dead sessions LinkedList< Long > deadSessions = new LinkedList< Long >(); for ( long session : zkDb.getSessions()) { - sessionsWithTimeouts = zkDb.getSessionWithTimeOuts(); - if (sessionsWithTimeouts.get(session) == null ) { + if (zkDb.getSessionWithTimeOuts().get(session) == null ) { deadSessions.add(session); } } but if it's just tidying up, that's fine (although it would seem better to lift the getSessionWithTimeOuts call to outside the loop). There's also some extra whitespace and an unused import or two, but those can get cleaned up later.
        Hide
        Jeremy Stribling added a comment -

        QA has verified that the patch to the 3.4 branch fixes the problem they were seeing. Thanks everyone for all your hard work on this!

        Show
        Jeremy Stribling added a comment - QA has verified that the patch to the 3.4 branch fixes the problem they were seeing. Thanks everyone for all your hard work on this!
        Hide
        Mahadev konar added a comment -

        Thanks for confirming Jeremy. Ill check this in now. The patch looks good to me though I think we need to clean up our classes so that we have cleaner seperation on what ZKS should be exposing and what ZKDatabase should be exposing.

        Show
        Mahadev konar added a comment - Thanks for confirming Jeremy. Ill check this in now. The patch looks good to me though I think we need to clean up our classes so that we have cleaner seperation on what ZKS should be exposing and what ZKDatabase should be exposing.
        Hide
        Mahadev konar added a comment -

        I just committed this. Thanks Ben and others!

        Show
        Mahadev konar added a comment - I just committed this. Thanks Ben and others!
        Hide
        Hudson added a comment -

        Integrated in ZooKeeper-trunk #1445 (See https://builds.apache.org/job/ZooKeeper-trunk/1445/)
        ZOOKEEPER-1367. Data inconsistencies and unexpired ephemeral nodes after cluster restart. (Benjamin Reed via mahadev)

        mahadev : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1238176
        Files :

        • /zookeeper/trunk/CHANGES.txt
        • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/FinalRequestProcessor.java
        • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/ZooKeeperServer.java
        • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/LeaderZooKeeperServer.java
        • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/Learner.java
        • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/LearnerZooKeeperServer.java
        • /zookeeper/trunk/src/java/test/org/apache/zookeeper/server/quorum/LearnerTest.java
        • /zookeeper/trunk/src/java/test/org/apache/zookeeper/server/quorum/Zab1_0Test.java
        Show
        Hudson added a comment - Integrated in ZooKeeper-trunk #1445 (See https://builds.apache.org/job/ZooKeeper-trunk/1445/ ) ZOOKEEPER-1367 . Data inconsistencies and unexpired ephemeral nodes after cluster restart. (Benjamin Reed via mahadev) mahadev : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1238176 Files : /zookeeper/trunk/CHANGES.txt /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/FinalRequestProcessor.java /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/ZooKeeperServer.java /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/LeaderZooKeeperServer.java /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/Learner.java /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/LearnerZooKeeperServer.java /zookeeper/trunk/src/java/test/org/apache/zookeeper/server/quorum/LearnerTest.java /zookeeper/trunk/src/java/test/org/apache/zookeeper/server/quorum/Zab1_0Test.java
        Hide
        Benjamin Reed added a comment -

        i have the test case to reproduce and patch for 3.3. should i upload it to this jira?

        Show
        Benjamin Reed added a comment - i have the test case to reproduce and patch for 3.3. should i upload it to this jira?
        Hide
        Mahadev konar added a comment -

        Great. Go ahead and upload. Ill commit it to the 3.3 branch.

        Show
        Mahadev konar added a comment - Great. Go ahead and upload. Ill commit it to the 3.3 branch.
        Hide
        Benjamin Reed added a comment -

        @henry i realized that didn't answer your question about the diff. the sessionsWithTimeouts data structure moved as part of the zkdb reuse work mahadev did a long time ago, but the sessionWithTimeouts stayed a field in the class even though it was only used locally in the function, so i cleaned that up a bit just to make sure that it really wasn't used anywhere else.

        Show
        Benjamin Reed added a comment - @henry i realized that didn't answer your question about the diff. the sessionsWithTimeouts data structure moved as part of the zkdb reuse work mahadev did a long time ago, but the sessionWithTimeouts stayed a field in the class even though it was only used locally in the function, so i cleaned that up a bit just to make sure that it really wasn't used anywhere else.
        Hide
        Benjamin Reed added a comment -

        here is the patch to fix 3.3 with a testcase that will reproduce the bug in 3.3.

        Show
        Benjamin Reed added a comment - here is the patch to fix 3.3 with a testcase that will reproduce the bug in 3.3.
        Hide
        Ted Yu added a comment -

        Mahadev has applied fix for 3.3

        I think only SnapshotSessionTest.java is needed for 3.3 branch at the moment.

        Show
        Ted Yu added a comment - Mahadev has applied fix for 3.3 I think only SnapshotSessionTest.java is needed for 3.3 branch at the moment.
        Hide
        Benjamin Reed added a comment -

        yes, that is correct. the SnapshotSessionTest reproduces the problem using a method that is much slower than what is used in the 3.4 and trunk patch.

        Show
        Benjamin Reed added a comment - yes, that is correct. the SnapshotSessionTest reproduces the problem using a method that is much slower than what is used in the 3.4 and trunk patch.
        Hide
        Mahadev konar added a comment -

        Ok,
        I went ahead and reverted the patch I had committed to 3.3 and committed the latest patch from Ben since it was a little better. Thanks again Ben!

        Show
        Mahadev konar added a comment - Ok, I went ahead and reverted the patch I had committed to 3.3 and committed the latest patch from Ben since it was a little better. Thanks again Ben!
        Hide
        Benjamin Mahler added a comment -

        I'm seeing a regression when upgrading from 3.3.4 to 3.3.6 and tracked it down to this patch.

        We make the following sequence of calls:

        ZooKeeperServer.startup() -> ZooKeeperServer.shutdown() -> ZooKeeperServer.startup()

        This is done through NIOServerCnxn::Factory:

        factory = new NIOServerCnxn::Factory(...);
        factory.startup(zkserver);
        factory.shutdown()
        
        factory = new NIOServerCnxn::Factory(...);
        factory.startup(zkserver); // Crashes with the stack trace below.
        factory.shutdown()
        

        Results in the following (on 3.3.6, not on 3.3.4):

        java.lang.IllegalThreadStateException
        	at java.lang.Thread.start(Thread.java:656)
        	at org.apache.zookeeper.server.ZooKeeperServer.startSessionTracker(ZooKeeperServer.java:402)
        	at org.apache.zookeeper.server.ZooKeeperServer.startup(ZooKeeperServer.java:376)
        	at org.apache.zookeeper.server.NIOServerCnxn$Factory.startup(NIOServerCnxn.java:161)
        Exception in thread "AWT-AppKit" Caught a JVM exception, not propagating
        

        Are we mis-using the API or is this a bug?

        Looking at the code, it's clear that calling ZooKeeperServer.startup twice in 3.3.6 will attempt to start the sessionTracker thread twice, which seems incorrect.

        Show
        Benjamin Mahler added a comment - I'm seeing a regression when upgrading from 3.3.4 to 3.3.6 and tracked it down to this patch. We make the following sequence of calls: ZooKeeperServer.startup() -> ZooKeeperServer.shutdown() -> ZooKeeperServer.startup() This is done through NIOServerCnxn::Factory: factory = new NIOServerCnxn::Factory(...); factory.startup(zkserver); factory.shutdown() factory = new NIOServerCnxn::Factory(...); factory.startup(zkserver); // Crashes with the stack trace below. factory.shutdown() Results in the following (on 3.3.6, not on 3.3.4): java.lang.IllegalThreadStateException at java.lang.Thread.start(Thread.java:656) at org.apache.zookeeper.server.ZooKeeperServer.startSessionTracker(ZooKeeperServer.java:402) at org.apache.zookeeper.server.ZooKeeperServer.startup(ZooKeeperServer.java:376) at org.apache.zookeeper.server.NIOServerCnxn$Factory.startup(NIOServerCnxn.java:161) Exception in thread "AWT-AppKit" Caught a JVM exception, not propagating Are we mis-using the API or is this a bug? Looking at the code, it's clear that calling ZooKeeperServer.startup twice in 3.3.6 will attempt to start the sessionTracker thread twice, which seems incorrect.

          People

          • Assignee:
            Benjamin Reed
            Reporter:
            Jeremy Stribling
          • Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development