ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-1069

Calling shutdown() on a QuorumPeer too quickly can lead to a corrupt log

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 3.3.3
    • Fix Version/s: 3.3.4, 3.4.0
    • Component/s: quorum, server
    • Environment:

      Linux, ZK 3.3.3, 3-node cluster.

      Description

      I've only seen this happen once. In order to restart Zookeeper with a new set of servers, we have a wrapper class that calls shutdown() on an existing QuorumPeer, and then starts a new one with a new set of servers. Specifically, our shutdown code looks like this:

        synchronized(_quorum_peer) {
          _quorum_peer.shutdown();
          FastLeaderElection fle = (FastLeaderElection) _quorum_peer.getElectionAlg();
          fle.shutdown();  // I think this is unnecessary
          try {
            _quorum_peer.getTxnFactory().commit();
          } catch (java.nio.channels.ClosedChannelException e) {
            // ignore
          }
        }
      

      One time, our wrapper class started one QuorumPeer, and then had to shut it down and start a new one very soon after the QuorumPeer transitioned into a FOLLOWING state. When the new QuorumPeer tried to read in the latest log from disk, it encountered a bogus magic number of all zeroes:

      2011-05-18 22:42:29,823 10467 [pool-1-thread-2] FATAL org.apache.zookeeper.server.quorum.QuorumPeer  - Unable to load database on disk
      java.io.IOException: Transaction log: /var/cloudnet/data/zookeeper/version-2/log.700000001 has invalid magic number 0 != 1514884167
              at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.inStreamCreated(FileTxnLog.java:510)
              at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.createInputArchive(FileTxnLog.java:527)
              at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.goToNextLog(FileTxnLog.java:493)
              at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:576)
              at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.init(FileTxnLog.java:479)
              at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.<init>(FileTxnLog.java:454)
              at org.apache.zookeeper.server.persistence.FileTxnLog.read(FileTxnLog.java:325)
              at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:126)
              at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:222)
              at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:398)
      ...
      2011-05-18 22:42:29,823 10467 [pool-1-thread-2] ERROR com.nicira.onix.zookeeper.Zookeeper  - Unexpected exception
      java.lang.RuntimeException: Unable to run quorum server 
              at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:401)
              at com.nicira.onix.zookeeper.Zookeeper.StartZookeeper(Zookeeper.java:198)
              at com.nicira.onix.zookeeper.Zookeeper.RestartZookeeper(Zookeeper.java:277)
              at com.nicira.onix.zookeeper.ZKRPCService.setServers(ZKRPC.java:83)
              at com.nicira.onix.zookeeper.Zkrpc$ZKRPCService.callMethod(Zkrpc.java:8198)
              at com.nicira.onix.rpc.RPC$10.run(RPC.java:534)
              at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
              at java.lang.Thread.run(Thread.java:662)
      Caused by: java.io.IOException: Transaction log: /var/cloudnet/data/zookeeper/version-2/log.700000001 has invalid magic number 0 != 1514884167
              at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.inStreamCreated(FileTxnLog.java:510)
              at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.createInputArchive(FileTxnLog.java:527)
              at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.goToNextLog(FileTxnLog.java:493)
              at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:576)
              at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.init(FileTxnLog.java:479)
              at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.<init>(FileTxnLog.java:454)
              at org.apache.zookeeper.server.persistence.FileTxnLog.read(FileTxnLog.java:325)
              at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:126)
              at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:222)
              at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:398)
              ... 8 more
      

      I looked into the code a bit, and I believe the problem comes from the fact that QuorumPeer.shutdown() does not join() on this before returning. Here's the scenario I think can happen:

      1. QuorumPeer.run() notices it is in the FOLLOWING state, makes a new Follower, and calls Follower.followLeader(), which starts connecting to the leader.
      2. In the main program thread, QuorumPeer.shutdown() is called.
      3. Through a complicated series of calls, this eventually leads to FollowerZooKeeperServer.shutdown() being called.
      4. This method calls SyncRequestProcess.shutdown(), which joins on this and returns. However, it's possible that the SyncRequestProcessor thread hasn't yet been started because followLeader() hasn't yet called Learner.syncWithLeader(), which hasn't yet called ZooKeeperServer.startup(), which actually starts the thread. Thus, the join would have no request, though a requestOfDeath is added to the queued requests list (possibly behind other requests).
      5. Back in the main thread, FileTxnSnapLog.commit() is called, which doesn't do much because the processor hasn't processed anything yet.
      6. Finally, ZooKeeperServer.startup is called in the QuorumPeer.run() thread, starting up the SyncRequestProcessor thread.
      7. That thread appends some request to the log. The log doesn't exist yet, so it creates a new one, padding it with zeroes.
      8. Now either the SyncRequestProcessor hits the requestOfDeath or the whole QuorumPeer object is deleted. It exits that thread without ever committing the log to disk (or the new QuorumPeer tries to read the log before the old thread gets to commit anything), and the log ends up with all zeroes instead of a proper magic number.

      I haven't yet looked into whether there's an easy way to join() on the QuorumPeer thread from shutdown(), so that it won't go on to start the processor threads after it's been shutdown. I wanted to check with the group first and see if anyone else agrees this could be a problem.

      I marked this as minor since I think almost no one else uses Zookeeper this way, but it's pretty important to me personally.

      I will upload a log file showing this behavior shortly.

      1. ZOOKEEPER-1069.log
        62 kB
        Jeremy Stribling
      2. log.700000001.gz
        64 kB
        Jeremy Stribling
      3. ZOOKEEPER-1069
        3 kB
        Vishal Kher
      4. ZOOKEEPER-1069.patch
        3 kB
        Vishal Kher

        Issue Links

          Activity

          Hide
          Jeremy Stribling added a comment -

          Here's a log showing this behavior, along with the log file that got produced (all zeroes, I think).

          Show
          Jeremy Stribling added a comment - Here's a log showing this behavior, along with the log file that got produced (all zeroes, I think).
          Hide
          Jeremy Stribling added a comment -

          This appears like it might be a duplicate of ZOOKEEPER-1060; if so, sorry for not finding it earlier. Feel free to close if so; I'll try out the recommended fix in that JIRA in my setup.

          Show
          Jeremy Stribling added a comment - This appears like it might be a duplicate of ZOOKEEPER-1060 ; if so, sorry for not finding it earlier. Feel free to close if so; I'll try out the recommended fix in that JIRA in my setup.
          Hide
          Patrick Hunt added a comment -

          Thanks for the feedback Jeremy, can you provide feedback on ZOOKEEPER-1060 as to whether you have/had success with that patch?

          Show
          Patrick Hunt added a comment - Thanks for the feedback Jeremy, can you provide feedback on ZOOKEEPER-1060 as to whether you have/had success with that patch?
          Hide
          Jeremy Stribling added a comment -

          I added a comment there, but as I mentioned, we've only hit this once in months of testing, and unfortunately I don't have the time to try to carefully reproduce it. I will update if I ever see the problem again even with this patch, but based on my analysis of the logs I'm pretty sure these issues are related.

          Show
          Jeremy Stribling added a comment - I added a comment there, but as I mentioned, we've only hit this once in months of testing, and unfortunately I don't have the time to try to carefully reproduce it. I will update if I ever see the problem again even with this patch, but based on my analysis of the logs I'm pretty sure these issues are related.
          Hide
          Vishal Kher added a comment -

          I think this is serious bug. 1060 is just interrupting
          QuorumPeer if it is waiting in poll() in lookForLeader().

          This bug I think is because of incorrect use of Java FileChannel and
          BufferedOutputStream which is causing data to be flushed to the log
          out of order.

          From FileTxnLog.append():

          [...]
          fhdr.serialize(oa, "fileheader"); <====== Buffered write
          currentSize = fos.getChannel().position();
          streamsToFlush.add(fos);
          }
          padFile(fos);

          From Util.padLogFile()

          f.getChannel().write(fill, currentSize-fill.remaining()); <========= Non buffered write

          A failure after call to padLogFile, but before flushing the
          BufferedOutputStream will lead to this bug.

          I am raising the priority of this bug.
          Please correct me if I am wrong.

          Show
          Vishal Kher added a comment - I think this is serious bug. 1060 is just interrupting QuorumPeer if it is waiting in poll() in lookForLeader(). This bug I think is because of incorrect use of Java FileChannel and BufferedOutputStream which is causing data to be flushed to the log out of order. From FileTxnLog.append(): [...] fhdr.serialize(oa, "fileheader"); <====== Buffered write currentSize = fos.getChannel().position(); streamsToFlush.add(fos); } padFile(fos); From Util.padLogFile() f.getChannel().write(fill, currentSize-fill.remaining()); <========= Non buffered write A failure after call to padLogFile, but before flushing the BufferedOutputStream will lead to this bug. I am raising the priority of this bug. Please correct me if I am wrong.
          Hide
          Vishal Kher added a comment -

          Jeremy,

          Can I suggest a better usage of QuorumPeer for embedded ZK?

          Start a thread (main) which calls QuorumPeerMain.initializeAndRun()

          In your application's shutdown method do:
          quorumpeer.shutdown();
          main.join();

          initializeAndrun() invokes runFromConfig() which does a quorumpeer.join.

          I think we need a standardize APIs for embedded ZK. I will open a JIRA for that.

          Show
          Vishal Kher added a comment - Jeremy, Can I suggest a better usage of QuorumPeer for embedded ZK? Start a thread (main) which calls QuorumPeerMain.initializeAndRun() In your application's shutdown method do: quorumpeer.shutdown(); main.join(); initializeAndrun() invokes runFromConfig() which does a quorumpeer.join. I think we need a standardize APIs for embedded ZK. I will open a JIRA for that.
          Hide
          Jeremy Stribling added a comment -

          Thanks for the suggestion, but calling initializeAndRun() would mean that we'd have to write our config out to a file, right? Right now our config gets transferred to the wrapper class via RPC from another program. We could of course write it all to a file after its transferred, but it seems like a waste of code.

          The other problem is that we need to be able to transition between 1 node cluster and 3 node clusters (and back), so we'd need a similar method for ZooKeeperServer. Maybe there is one and I just haven't found it yet.

          Could I just do a QuorumPeer.join() from my own shutdown method?

          Show
          Jeremy Stribling added a comment - Thanks for the suggestion, but calling initializeAndRun() would mean that we'd have to write our config out to a file, right? Right now our config gets transferred to the wrapper class via RPC from another program. We could of course write it all to a file after its transferred, but it seems like a waste of code. The other problem is that we need to be able to transition between 1 node cluster and 3 node clusters (and back), so we'd need a similar method for ZooKeeperServer. Maybe there is one and I just haven't found it yet. Could I just do a QuorumPeer.join() from my own shutdown method?
          Hide
          Vishal Kher added a comment -

          Could I just do a QuorumPeer.join() from my own shutdown method?

          Yeah, that should work. BTW, why are you calling commit after shutdown?

          Show
          Vishal Kher added a comment - Could I just do a QuorumPeer.join() from my own shutdown method? Yeah, that should work. BTW, why are you calling commit after shutdown?
          Hide
          Jeremy Stribling added a comment -

          Yeah, that should work. BTW, why are you calling commit after shutdown?

          If I remember correctly, I think I ran into this problem a really long time ago, and I figured it was because I was shutting down the QuorumPeer without giving it a chance to commit to disk. (A cursory glance at shutdown() didn't show anything that was committing the log.) Should a QuorumPeer.join() (plus a fix for this bug) be enough to guarantee that everything gets pushed consistently do disk?

          Show
          Jeremy Stribling added a comment - Yeah, that should work. BTW, why are you calling commit after shutdown? If I remember correctly, I think I ran into this problem a really long time ago, and I figured it was because I was shutting down the QuorumPeer without giving it a chance to commit to disk. (A cursory glance at shutdown() didn't show anything that was committing the log.) Should a QuorumPeer.join() (plus a fix for this bug) be enough to guarantee that everything gets pushed consistently do disk?
          Hide
          Vishal Kher added a comment -

          Attaching patch.

          The bug can be reproduced by killing the server after call to padLog. Verified that the patch fixes the problem (server can read the magic number after restart).

          Show
          Vishal Kher added a comment - Attaching patch. The bug can be reproduced by killing the server after call to padLog. Verified that the patch fixes the problem (server can read the magic number after restart).
          Hide
          Vishal Kher added a comment -

          Submitting patch for trunk.

          Show
          Vishal Kher added a comment - Submitting patch for trunk.
          Hide
          Camille Fournier added a comment -

          The patch looks ok but could the test perhaps simulate creating a transaction log, appending the magic number, then verifying the log looks good without calling close? FileTxnLog.append is a public method so should be pretty easy to do.

          Show
          Camille Fournier added a comment - The patch looks ok but could the test perhaps simulate creating a transaction log, appending the magic number, then verifying the log looks good without calling close? FileTxnLog.append is a public method so should be pretty easy to do.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12480487/ZOOKEEPER-1069
          against trunk revision 1125581.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 7 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/291//testReport/
          Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/291//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/291//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12480487/ZOOKEEPER-1069 against trunk revision 1125581. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 7 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/291//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/291//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/291//console This message is automatically generated.
          Hide
          Vishal Kher added a comment -

          need a better test.

          Show
          Vishal Kher added a comment - need a better test.
          Hide
          Vishal Kher added a comment -

          Hi Camille,

          Attaching patch with modified test.

          Show
          Vishal Kher added a comment - Hi Camille, Attaching patch with modified test.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12480581/ZOOKEEPER-1069.patch
          against trunk revision 1125581.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 4 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/295//testReport/
          Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/295//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/295//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12480581/ZOOKEEPER-1069.patch against trunk revision 1125581. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/295//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/295//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/295//console This message is automatically generated.
          Hide
          Camille Fournier added a comment -

          Looks good Vishal, +1 from me.
          The LoadFromLogTest is not in 3.3.3. Do we want to apply this just to trunk, or to trunk and 3.3.3 minus the test?

          Show
          Camille Fournier added a comment - Looks good Vishal, +1 from me. The LoadFromLogTest is not in 3.3.3. Do we want to apply this just to trunk, or to trunk and 3.3.3 minus the test?
          Hide
          Vishal Kher added a comment -

          Thanks. I would think path for both ZOOKEEPER-1046 and this bug need to go on 3.3. I think you pointed out earlier about some issues while porting to 3.3. I also think ZOOKEEPER-975 needs to go in 3.3. I am not clear about the rationale behind which patch goes to which version.

          Show
          Vishal Kher added a comment - Thanks. I would think path for both ZOOKEEPER-1046 and this bug need to go on 3.3. I think you pointed out earlier about some issues while porting to 3.3. I also think ZOOKEEPER-975 needs to go in 3.3. I am not clear about the rationale behind which patch goes to which version.
          Hide
          Camille Fournier added a comment -

          I'm not either. Could someone else chime in on the process for this?

          Show
          Camille Fournier added a comment - I'm not either. Could someone else chime in on the process for this?
          Hide
          Benjamin Reed added a comment -

          the basic rule i that only bug fixes go into old releases. usually we try to commit bug fixes to both the latest branch and to trunk.

          Show
          Benjamin Reed added a comment - the basic rule i that only bug fixes go into old releases. usually we try to commit bug fixes to both the latest branch and to trunk.
          Hide
          Hudson added a comment -

          Integrated in ZooKeeper-trunk #1205 (See https://builds.apache.org/hudson/job/ZooKeeper-trunk/1205/)
          ZOOKEEPER-1069: Calling shutdown() on a QuorumPeer too quickly can lead to a corrupt log

          camille : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1132489
          Files :

          • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/persistence/FileTxnLog.java
          • /zookeeper/trunk/src/java/test/org/apache/zookeeper/test/LoadFromLogTest.java
          • /zookeeper/trunk/CHANGES.txt
          Show
          Hudson added a comment - Integrated in ZooKeeper-trunk #1205 (See https://builds.apache.org/hudson/job/ZooKeeper-trunk/1205/ ) ZOOKEEPER-1069 : Calling shutdown() on a QuorumPeer too quickly can lead to a corrupt log camille : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1132489 Files : /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/persistence/FileTxnLog.java /zookeeper/trunk/src/java/test/org/apache/zookeeper/test/LoadFromLogTest.java /zookeeper/trunk/CHANGES.txt
          Hide
          Henry Robinson added a comment -

          Was a bit hasty closing this, I see that Camille is still trying to backport this to 3.3.3.

          Show
          Henry Robinson added a comment - Was a bit hasty closing this, I see that Camille is still trying to backport this to 3.3.3.
          Hide
          Camille Fournier added a comment -

          I will try to finish this up later this week.

          Show
          Camille Fournier added a comment - I will try to finish this up later this week.
          Hide
          Flavio Junqueira added a comment -

          It should be 3.3.4 and not 3.3.3 (thanks for pointing out, Mahadev).

          Show
          Flavio Junqueira added a comment - It should be 3.3.4 and not 3.3.3 (thanks for pointing out, Mahadev).
          Hide
          Mahadev konar added a comment -

          Camille/Falvio,
          Any progress on this? Can we close this now?

          Show
          Mahadev konar added a comment - Camille/Falvio, Any progress on this? Can we close this now?
          Hide
          Flavio Junqueira added a comment -

          My understanding is that we have committed a patch for this jira to trunk, but there is still a patch missing for the 3.3 branch. Is there anyone working on a patch for 3.3? Camille, Vishal?

          Show
          Flavio Junqueira added a comment - My understanding is that we have committed a patch for this jira to trunk, but there is still a patch missing for the 3.3 branch. Is there anyone working on a patch for 3.3? Camille, Vishal?
          Hide
          Camille Fournier added a comment -

          Whoops sorry, somehow I lost this one. Will take a look today.

          Show
          Camille Fournier added a comment - Whoops sorry, somehow I lost this one. Will take a look today.
          Hide
          Camille Fournier added a comment -

          Ok the only merge needed now is deleting the LOG statement from the test (relies on slf4j) so I'm just going to check it in.

          Show
          Camille Fournier added a comment - Ok the only merge needed now is deleting the LOG statement from the test (relies on slf4j) so I'm just going to check it in.

            People

            • Assignee:
              Vishal Kher
              Reporter:
              Jeremy Stribling
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development