ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-1277

servers stop serving when lower 32bits of zxid roll over

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 3.3.3
    • Fix Version/s: 3.3.5, 3.4.4, 3.5.0
    • Component/s: server
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      Workaround: there is a simple workaround for this issue. Force a leader re-election before the lower 32bits reach 0xffffffff

      Most users won't even see this given the number of writes on a typical installation - say you are doing 500 writes/second, you'd see this after ~3 months if the quorum is stable, any changes (such as upgrading the server software) would cause the xid to be reset, thereby staving off this issue for another period.
      Show
      Workaround: there is a simple workaround for this issue. Force a leader re-election before the lower 32bits reach 0xffffffff Most users won't even see this given the number of writes on a typical installation - say you are doing 500 writes/second, you'd see this after ~3 months if the quorum is stable, any changes (such as upgrading the server software) would cause the xid to be reset, thereby staving off this issue for another period.

      Description

      When the lower 32bits of a zxid "roll over" (zxid is a 64 bit number, however the upper 32 are considered the epoch number) the epoch number (upper 32 bits) are incremented and the lower 32 start at 0 again.

      This should work fine, however in the current 3.3 branch the followers see this as a NEWLEADER message, which it's not, and effectively stop serving clients. Attached clients seem to eventually time out given that heartbeats (or any operation) are no longer processed. The follower doesn't recover from this.

      I've tested this out on 3.3 branch and confirmed this problem, however I haven't tried it on 3.4/3.5. It may not happen on the newer branches due to ZOOKEEPER-335, however there is certainly an issue with updating the "acceptedEpoch" files contained in the datadir. (I'll enter a separate jira for that)

      1. ZOOKEEPER-1277_br33.patch
        22 kB
        Patrick Hunt
      2. ZOOKEEPER-1277_br33.patch
        22 kB
        Patrick Hunt
      3. ZOOKEEPER-1277_br33.patch
        20 kB
        Patrick Hunt
      4. ZOOKEEPER-1277_br33.patch
        12 kB
        Patrick Hunt
      5. ZOOKEEPER-1277_br34.patch
        26 kB
        Patrick Hunt
      6. ZOOKEEPER-1277_br34.patch
        26 kB
        Patrick Hunt
      7. ZOOKEEPER-1277_trunk.patch
        25 kB
        Patrick Hunt
      8. ZOOKEEPER-1277_trunk.patch
        25 kB
        Patrick Hunt

        Issue Links

          Activity

          Hide
          Patrick Hunt added a comment -

          I'm thinking a simple fix for this in 3.3 is to skip 0x00000000 xid and go right to 0x00000001. That's a simple change, but testing out the various cases will be tougher. I'm working on that.

          Show
          Patrick Hunt added a comment - I'm thinking a simple fix for this in 3.3 is to skip 0x00000000 xid and go right to 0x00000001. That's a simple change, but testing out the various cases will be tougher. I'm working on that.
          Hide
          Patrick Hunt added a comment -

          This patch is against branch 3.3 so expect it to fail qa bot.

          I've added a number of tests to try and verify that the rollover will both work, and that subsequently any restarts of various quorum members will return consistent results. afaict it's all working.

          Show
          Patrick Hunt added a comment - This patch is against branch 3.3 so expect it to fail qa bot. I've added a number of tests to try and verify that the rollover will both work, and that subsequently any restarts of various quorum members will return consistent results. afaict it's all working.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12503459/ZOOKEEPER-1277_br33.patch
          against trunk revision 1201045.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 5 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/787//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/787//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/787//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12503459/ZOOKEEPER-1277_br33.patch against trunk revision 1201045. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 5 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/787//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/787//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/787//console This message is automatically generated.
          Hide
          Flavio Junqueira added a comment -

          Hi Pat, This is not correct. The correct thing to do here is to have the leader dropping leadership once it is about to wrap around the epoch counter and force a leader election. This approach will also move the ensemble to a new epoch as you're proposing, but it will correctly move to this new epoch by running the recovery phases.

          Show
          Flavio Junqueira added a comment - Hi Pat, This is not correct. The correct thing to do here is to have the leader dropping leadership once it is about to wrap around the epoch counter and force a leader election. This approach will also move the ensemble to a new epoch as you're proposing, but it will correctly move to this new epoch by running the recovery phases.
          Hide
          Patrick Hunt added a comment -

          I thought about that but it seemed like a bad idea for 2 reasons I could think of:
          1) it would cause all of the clients to disconnect and reconnect unnecessarily, perhaps introducing instability in the process.
          2) can we guarantee that the leader will give up leadership? ie how to effect this, exit the JVM on the leader?

          In talking with Ben about it in the past (perhaps he's since changed his mind) he seemed to think that rolling over to a new epoch number (with no leader re-election) was OK.

          Show
          Patrick Hunt added a comment - I thought about that but it seemed like a bad idea for 2 reasons I could think of: 1) it would cause all of the clients to disconnect and reconnect unnecessarily, perhaps introducing instability in the process. 2) can we guarantee that the leader will give up leadership? ie how to effect this, exit the JVM on the leader? In talking with Ben about it in the past (perhaps he's since changed his mind) he seemed to think that rolling over to a new epoch number (with no leader re-election) was OK.
          Hide
          Flavio Junqueira added a comment -

          The scenario I have in mind to say this is incorrect is more or less the following:

          1. Leader L is currently in epoch 3 and it moves to epoch 4 in the way this patch proposes by simply adding 2 to hzxid. The leader proposes a transaction with zxid <4,1>, which is acknowledged by some follower F, but not a quorum;
          2. Concurrently, a new leader L' arises and selects 4 as its epoch (it hasn't talked to L or F);
          3. L' proposes a transaction with zxid <4,1>, which is different from the transaction L proposed with the same zxid and this transaction is acknowledged by a quorum;
          4. L eventually gives up on leadership after noticing that it is not supported by a quorum;
          5. L' crashes;
          6. A new leader arises and its highest zxid is <4,1>. It doesn't have to synchronize with any of the followers because they all have highest zxid <4,1>. We have servers that have different transaction values for the same zxid, which constitutes an inconsistent state.
          Show
          Flavio Junqueira added a comment - The scenario I have in mind to say this is incorrect is more or less the following: Leader L is currently in epoch 3 and it moves to epoch 4 in the way this patch proposes by simply adding 2 to hzxid. The leader proposes a transaction with zxid <4,1>, which is acknowledged by some follower F, but not a quorum; Concurrently, a new leader L' arises and selects 4 as its epoch (it hasn't talked to L or F); L' proposes a transaction with zxid <4,1>, which is different from the transaction L proposed with the same zxid and this transaction is acknowledged by a quorum; L eventually gives up on leadership after noticing that it is not supported by a quorum; L' crashes; A new leader arises and its highest zxid is <4,1>. It doesn't have to synchronize with any of the followers because they all have highest zxid <4,1>. We have servers that have different transaction values for the same zxid, which constitutes an inconsistent state.
          Hide
          Patrick Hunt added a comment -

          I see. Yes that would be bad. I'll try reworking the patch to drop leadership. Any suggestions on were to look to make that happen?

          Show
          Patrick Hunt added a comment - I see. Yes that would be bad. I'll try reworking the patch to drop leadership. Any suggestions on were to look to make that happen?
          Hide
          Flavio Junqueira added a comment -

          For a quorum setup, it sounds like a good place would be in ProposalRequestProcessor.proposeRequest(). For standalone, it sounds like we should be doing something along the lines of what you proposed in your patch.

          Show
          Flavio Junqueira added a comment - For a quorum setup, it sounds like a good place would be in ProposalRequestProcessor.proposeRequest(). For standalone, it sounds like we should be doing something along the lines of what you proposed in your patch.
          Hide
          Patrick Hunt added a comment -

          I'll rework the patch and get back. Thanks for the feedback Flavio.

          Show
          Patrick Hunt added a comment - I'll rework the patch and get back. Thanks for the feedback Flavio.
          Hide
          Benjamin Reed added a comment -

          the problem is that we need to also worry about getting the accepted and current epochs correct when the rollover happens, so we have to do a bit of handshaking when the rollover happens. dropping leadership is the easiest and safest thing to do. the problem i have with special handling for rollover to make the epoch change faster is that the code path would almost never get hit and the path is non-trivial, so it would never be hardened. i would prefer to change the zxid to a long, long before trying to add special logic to handle that case.

          Show
          Benjamin Reed added a comment - the problem is that we need to also worry about getting the accepted and current epochs correct when the rollover happens, so we have to do a bit of handshaking when the rollover happens. dropping leadership is the easiest and safest thing to do. the problem i have with special handling for rollover to make the epoch change faster is that the code path would almost never get hit and the path is non-trivial, so it would never be hardened. i would prefer to change the zxid to a long, long before trying to add special logic to handle that case.
          Hide
          Patrick Hunt added a comment -

          A second attempt to fix this based on Flavio/Ben's feedback.

          This is just the initial patch, I'm still working on the testing but I wanted to have you take a look before I spend a bunch of time on it again and not have it work out. If you could take a look/comment that would be great.

          org.apache.zookeeper.server.quorum.Leader.propose(Request)

          is the interesting bit, the rest is just some plumbing changes to carry up the exception.

          Show
          Patrick Hunt added a comment - A second attempt to fix this based on Flavio/Ben's feedback. This is just the initial patch, I'm still working on the testing but I wanted to have you take a look before I spend a bunch of time on it again and not have it work out. If you could take a look/comment that would be great. org.apache.zookeeper.server.quorum.Leader.propose(Request) is the interesting bit, the rest is just some plumbing changes to carry up the exception.
          Hide
          Flavio Junqueira added a comment -

          Ok, I only looked at propose() as you suggested, Pat. That method sounds right: it forces a leader election when we reach the limit. However, I'm not sure how we guarantee that Zab will work correctly under this exception. It is an invariant of the protocol that a follower won't go back to a previous epoch; if we roll over, then followers will have to go back to a previous epoch, no? How do we make sure that it doesn't break the protocol implementation?

          Show
          Flavio Junqueira added a comment - Ok, I only looked at propose() as you suggested, Pat. That method sounds right: it forces a leader election when we reach the limit. However, I'm not sure how we guarantee that Zab will work correctly under this exception. It is an invariant of the protocol that a follower won't go back to a previous epoch; if we roll over, then followers will have to go back to a previous epoch, no? How do we make sure that it doesn't break the protocol implementation?
          Hide
          Flavio Junqueira added a comment -

          Offline discussion with Pat: the check in propose() is only for the zxid, so the issue I raised about epochs does not apply.

          Show
          Flavio Junqueira added a comment - Offline discussion with Pat: the check in propose() is only for the zxid, so the issue I raised about epochs does not apply.
          Hide
          Patrick Hunt added a comment -

          That's correct, based on the feedback I got from the previous attempt it was clear that we cannot continue without a re-election. In this case I'm looking for the "just about to occur rollover" and I'm dropping leadership at that point. The re-election will then happen, a new epoch chosen, and the lower 32bit thereby reset.

          Show
          Patrick Hunt added a comment - That's correct, based on the feedback I got from the previous attempt it was clear that we cannot continue without a re-election. In this case I'm looking for the "just about to occur rollover" and I'm dropping leadership at that point. The re-election will then happen, a new epoch chosen, and the lower 32bit thereby reset.
          Hide
          Benjamin Reed added a comment -

          this looks good to me as well. the new exception makes it look a bit more complicated than it should, but i see why you do it.

          Show
          Benjamin Reed added a comment - this looks good to me as well. the new exception makes it look a bit more complicated than it should, but i see why you do it.
          Hide
          Mahadev konar added a comment -

          Sorry coming in a little late. Looks good to me as well. Would have wanted to avoid the property even though its just for testing. Here is when refactoring for testing would have made sense

          Show
          Mahadev konar added a comment - Sorry coming in a little late. Looks good to me as well. Would have wanted to avoid the property even though its just for testing. Here is when refactoring for testing would have made sense
          Hide
          Patrick Hunt added a comment -

          haha, yea I hear you. This is not for unit test testing though. I use setZxid for that (see the original test).

          The system property is to allow QA to test this on a real cluster. I've used this for the first level of verification - I started a 3 node cluster with this system property and used a std client to force the re-election (by creating znodes for example). I can then see that the real servers are operating properly and handle this case - without waiting for a month of writes to go through the system. That make more sense? (I'll update the comment)

          Show
          Patrick Hunt added a comment - haha, yea I hear you. This is not for unit test testing though. I use setZxid for that (see the original test). The system property is to allow QA to test this on a real cluster. I've used this for the first level of verification - I started a 3 node cluster with this system property and used a std client to force the re-election (by creating znodes for example). I can then see that the real servers are operating properly and handle this case - without waiting for a month of writes to go through the system. That make more sense? (I'll update the comment)
          Hide
          Mahadev konar added a comment -

          Ahh... That makes more sense! Updated comments would be good. Thanks!

          Show
          Mahadev konar added a comment - Ahh... That makes more sense! Updated comments would be good. Thanks!
          Hide
          Patrick Hunt added a comment -

          This is basically the same patch, but with the tests working/passing and the comment updated that Mahadev highlighted.

          I've verified this in br33 using both the tests and testing manually using the system property to force quicker rolloer.

          Show
          Patrick Hunt added a comment - This is basically the same patch, but with the tests working/passing and the comment updated that Mahadev highlighted. I've verified this in br33 using both the tests and testing manually using the system property to force quicker rolloer.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12518421/ZOOKEEPER-1277_trunk.patch
          against trunk revision 1297740.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 5 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/993//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/993//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/993//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12518421/ZOOKEEPER-1277_trunk.patch against trunk revision 1297740. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 5 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/993//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/993//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/993//console This message is automatically generated.
          Hide
          Patrick Hunt added a comment -

          addressed the findbugs issue.

          Show
          Patrick Hunt added a comment - addressed the findbugs issue.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12518426/ZOOKEEPER-1277_trunk.patch
          against trunk revision 1297740.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 5 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/994//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/994//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/994//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12518426/ZOOKEEPER-1277_trunk.patch against trunk revision 1297740. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 5 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/994//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/994//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/994//console This message is automatically generated.
          Hide
          Mahadev konar added a comment -

          +1 on the patches. Looked through all 3. Good to go! Thanks Pat!

          Show
          Mahadev konar added a comment - +1 on the patches. Looked through all 3. Good to go! Thanks Pat!
          Hide
          Hudson added a comment -

          Integrated in ZooKeeper-trunk #1493 (See https://builds.apache.org/job/ZooKeeper-trunk/1493/)
          ZOOKEEPER-1277. servers stop serving when lower 32bits of zxid roll over (phunt) (Revision 1301079)

          Result = SUCCESS
          phunt : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1301079
          Files :

          • /zookeeper/trunk/CHANGES.txt
          • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/PrepRequestProcessor.java
          • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/RequestProcessor.java
          • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/SyncRequestProcessor.java
          • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/ZooKeeperServer.java
          • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/Leader.java
          • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/ProposalRequestProcessor.java
          • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/ReadOnlyRequestProcessor.java
          • /zookeeper/trunk/src/java/test/org/apache/zookeeper/server/ZxidRolloverTest.java
          • /zookeeper/trunk/src/java/test/org/apache/zookeeper/test/ClientBase.java
          Show
          Hudson added a comment - Integrated in ZooKeeper-trunk #1493 (See https://builds.apache.org/job/ZooKeeper-trunk/1493/ ) ZOOKEEPER-1277 . servers stop serving when lower 32bits of zxid roll over (phunt) (Revision 1301079) Result = SUCCESS phunt : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1301079 Files : /zookeeper/trunk/CHANGES.txt /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/PrepRequestProcessor.java /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/RequestProcessor.java /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/SyncRequestProcessor.java /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/ZooKeeperServer.java /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/Leader.java /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/ProposalRequestProcessor.java /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/ReadOnlyRequestProcessor.java /zookeeper/trunk/src/java/test/org/apache/zookeeper/server/ZxidRolloverTest.java /zookeeper/trunk/src/java/test/org/apache/zookeeper/test/ClientBase.java
          Hide
          Dave Latham added a comment -

          We recently experienced an HBase outage that I believe was caused by this issue. Running on ZK 3.4.4, the log for the leader shows this:

          2013-04-12 17:46:25,894 INFO org.apache.zookeeper.server.quorum.Leader: Have quorum of supporters; starting up and setting last processed zxid: 0x1a00000004
          2013-04-12 17:46:25,895 WARN org.apache.zookeeper.server.FinalRequestProcessor: Zxid outstanding 111669149696 is less than current 111669149697
          2013-04-12 17:46:25,895 WARN org.apache.zookeeper.server.quorum.LearnerHandler: ******* GOODBYE /10.0.1.100:34796 ********
          2013-04-12 17:46:25,896 ERROR org.apache.zookeeper.server.NIOServerCnxnFactory: Thread LearnerHandler Socket[addr=/10.0.1.100,port=34796,localport=2888] tickOfLastAck:897811 synced?:true queuedPacketLength:0 died
          java.lang.IllegalThreadStateException
          	at java.lang.Thread.start(Thread.java:638)
          	at org.apache.zookeeper.server.quorum.LeaderZooKeeperServer.startSessionTracker(LeaderZooKeeperServer.java:87)
          	at org.apache.zookeeper.server.ZooKeeperServer.startup(ZooKeeperServer.java:394)
          	at org.apache.zookeeper.server.quorum.Leader.processAck(Leader.java:531)
          	at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:497)
          

          Immediately after this one of the followers had a new election and became a follower again. Also, the heap on the leader immediately climbed until the process became stuck spending most of its time in GC. At this point HBase region servers started dropping like flies and then the ZK node was killed.

          I'm adding this comment now for two purposes. First, so that if other people see the same symptom in their logs they may find this issue faster. Second, I'd love to hear from anyone more familiar with ZooKeeper if this issue does indeeed explain the observations I wrote and mentioned above.

          Show
          Dave Latham added a comment - We recently experienced an HBase outage that I believe was caused by this issue. Running on ZK 3.4.4, the log for the leader shows this: 2013-04-12 17:46:25,894 INFO org.apache.zookeeper.server.quorum.Leader: Have quorum of supporters; starting up and setting last processed zxid: 0x1a00000004 2013-04-12 17:46:25,895 WARN org.apache.zookeeper.server.FinalRequestProcessor: Zxid outstanding 111669149696 is less than current 111669149697 2013-04-12 17:46:25,895 WARN org.apache.zookeeper.server.quorum.LearnerHandler: ******* GOODBYE /10.0.1.100:34796 ******** 2013-04-12 17:46:25,896 ERROR org.apache.zookeeper.server.NIOServerCnxnFactory: Thread LearnerHandler Socket[addr=/10.0.1.100,port=34796,localport=2888] tickOfLastAck:897811 synced?:true queuedPacketLength:0 died java.lang.IllegalThreadStateException at java.lang.Thread.start(Thread.java:638) at org.apache.zookeeper.server.quorum.LeaderZooKeeperServer.startSessionTracker(LeaderZooKeeperServer.java:87) at org.apache.zookeeper.server.ZooKeeperServer.startup(ZooKeeperServer.java:394) at org.apache.zookeeper.server.quorum.Leader.processAck(Leader.java:531) at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:497) Immediately after this one of the followers had a new election and became a follower again. Also, the heap on the leader immediately climbed until the process became stuck spending most of its time in GC. At this point HBase region servers started dropping like flies and then the ZK node was killed. I'm adding this comment now for two purposes. First, so that if other people see the same symptom in their logs they may find this issue faster. Second, I'd love to hear from anyone more familiar with ZooKeeper if this issue does indeeed explain the observations I wrote and mentioned above.
          Hide
          Dave Latham added a comment -

          Excuse me, we were running 3.4.3, not 3.4.4

          Show
          Dave Latham added a comment - Excuse me, we were running 3.4.3, not 3.4.4
          Hide
          Patrick Hunt added a comment -

          Hi Dave Latham, it seems unlikely to me. Are you only running hbase against ZK? Because in that case the number of changes to zk are going to be <<<< than 4billion (the amount necessary to roll over the lower 32 bits), hbase just doesn't generate that much traffic. I've only seen the rollover case with 10k's of clients doing large numbers of operations per second. hbase just doesn't drive that much traffic - it's mainly for failover and table management.

          You might have hit an issue with 3.4 that was fixed in a subsequent release. However the symptoms you mentioned don't ring a bell either....

          Show
          Patrick Hunt added a comment - Hi Dave Latham , it seems unlikely to me. Are you only running hbase against ZK? Because in that case the number of changes to zk are going to be <<<< than 4billion (the amount necessary to roll over the lower 32 bits), hbase just doesn't generate that much traffic. I've only seen the rollover case with 10k's of clients doing large numbers of operations per second. hbase just doesn't drive that much traffic - it's mainly for failover and table management. You might have hit an issue with 3.4 that was fixed in a subsequent release. However the symptoms you mentioned don't ring a bell either....
          Hide
          Dave Latham added a comment -

          Thanks for the response, Patrick Hunt. It is only HBase, but there are 1000 region servers and are using replication which puts much greater load on ZK. Taking a recent sample I see the zxid going up by thousands per second.

          Show
          Dave Latham added a comment - Thanks for the response, Patrick Hunt . It is only HBase, but there are 1000 region servers and are using replication which puts much greater load on ZK. Taking a recent sample I see the zxid going up by thousands per second.
          Hide
          Patrick Hunt added a comment -

          Dave Latham this could be it then. 1k's/sec means ~ a month before rollover.

          Show
          Patrick Hunt added a comment - Dave Latham this could be it then. 1k's/sec means ~ a month before rollover.
          Hide
          Lu Xuehui added a comment -

          when the zixd roll over, the epoch++ ; a new leader arises ,the epoch += 2. this way can avoid throw Exception ?

          Show
          Lu Xuehui added a comment - when the zixd roll over, the epoch++ ; a new leader arises ,the epoch += 2. this way can avoid throw Exception ?

            People

            • Assignee:
              Patrick Hunt
              Reporter:
              Patrick Hunt
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development