Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Not A Problem
-
3.4.10
-
None
-
None
-
None
Description
Please ignore. I tested against the wrong version of ZooKeeper and this was resolved by ZOOKEEPER-1653
We have noticed on internal executions of the integration tests rare failures of org.apache.zookeeper.test.CnxManagerTest.testWorkerThreads.
java.lang.RuntimeException: Unable to run quorum server at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:565) at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:520) at org.apache.zookeeper.test.CnxManagerTest.testWorkerThreads(CnxManagerTest.java:328) at org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) Caused by: java.io.IOException: The current epoch, 0, is older than the last zxid, 4294967296 at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:546)
along with this strange stack trace in the logs:
java.nio.channels.ClosedByInterruptException at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202) at sun.nio.ch.FileChannelImpl.force(FileChannelImpl.java:380) at org.apache.zookeeper.common.AtomicFileOutputStream.close(AtomicFileOutputStream.java:71) at org.apache.zookeeper.server.quorum.QuorumPeer.writeLongToFile(QuorumPeer.java:1232) at org.apache.zookeeper.server.quorum.QuorumPeer.setCurrentEpoch(QuorumPeer.java:1253) at org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:412) at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:83) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:851)
It appears that this failure is related to the usage of ((FileOutputStream) out).getChannel().force(true) in AtomicFileOutputStream. FileChannel#force appears to be interruptible, which is not desirable behavior when writing the epoch file. The interrupt may be triggered by the repeated starting and shutting down of quorum peers in testWorkerThreads. Branch 3.5 uses FileDescriptor#sync which is not interruptible and does not appear to have the same problem.
-I was able to find another JIRA ticket describing a similar issue here:' https://issues.apache.org/jira/browse/DERBY-4963-
There is also interesting discussion in https://issues.apache.org/jira/browse/ZOOKEEPER-1835ZOOKEEPER-1835 (where the change was made for 3.5) although these discussions appear to be Windows centric (we noticed the issue on Linux)
The failure appears to have popped up on "ZOOKEEPER-2297 PreCommit Build #3241" but jenkins cleared out the logs (I only still have the test report from the mailing list).
In addition, testWorkerThreads appears to be failing every few months on Solaris on Apache Jenkins (for 3.4 ZooKeeper_branch34_solaris - Build # 1430 and 3.5 ZooKeeper_branch35_solaris - Build # 387), but at the time I wrote this Jenkins had cleaned out the logs from the latest failed run so I have no way of determining if the cause is the same.