Uploaded image for project: 'ZooKeeper'
  1. ZooKeeper
  2. ZOOKEEPER-2803

Flaky test: org.apache.zookeeper.test.CnxManagerTest.testWorkerThreads

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Not A Problem
    • 3.4.10
    • None
    • None
    • None

    Description

      Please ignore. I tested against the wrong version of ZooKeeper and this was resolved by ZOOKEEPER-1653

      We have noticed on internal executions of the integration tests rare failures of org.apache.zookeeper.test.CnxManagerTest.testWorkerThreads.

      java.lang.RuntimeException: Unable to run quorum server 
      	at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:565)
      	at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:520)
      	at org.apache.zookeeper.test.CnxManagerTest.testWorkerThreads(CnxManagerTest.java:328)
      	at org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
      Caused by: java.io.IOException: The current epoch, 0, is older than the last zxid, 4294967296
      	at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:546)
      

      along with this strange stack trace in the logs:

      java.nio.channels.ClosedByInterruptException
      	at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
      	at sun.nio.ch.FileChannelImpl.force(FileChannelImpl.java:380)
      	at org.apache.zookeeper.common.AtomicFileOutputStream.close(AtomicFileOutputStream.java:71)
      	at org.apache.zookeeper.server.quorum.QuorumPeer.writeLongToFile(QuorumPeer.java:1232)
      	at org.apache.zookeeper.server.quorum.QuorumPeer.setCurrentEpoch(QuorumPeer.java:1253)
      	at org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:412)
      	at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:83)
      	at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:851)
      

      It appears that this failure is related to the usage of ((FileOutputStream) out).getChannel().force(true) in AtomicFileOutputStream. FileChannel#force appears to be interruptible, which is not desirable behavior when writing the epoch file. The interrupt may be triggered by the repeated starting and shutting down of quorum peers in testWorkerThreads. Branch 3.5 uses FileDescriptor#sync which is not interruptible and does not appear to have the same problem.

      -I was able to find another JIRA ticket describing a similar issue here:' https://issues.apache.org/jira/browse/DERBY-4963-

      There is also interesting discussion in ZOOKEEPER-1835 (where the change was made for 3.5) although these discussions appear to be Windows centric (we noticed the issue on Linux)https://issues.apache.org/jira/browse/ZOOKEEPER-1835

      The failure appears to have popped up on "ZOOKEEPER-2297 PreCommit Build #3241" but jenkins cleared out the logs (I only still have the test report from the mailing list).

      In addition, testWorkerThreads appears to be failing every few months on Solaris on Apache Jenkins (for 3.4 ZooKeeper_branch34_solaris - Build # 1430 and 3.5 ZooKeeper_branch35_solaris - Build # 387), but at the time I wrote this Jenkins had cleaned out the logs from the latest failed run so I have no way of determining if the cause is the same.

      Attachments

        Activity

          People

            abrahamfine Abraham Fine
            abrahamfine Abraham Fine
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: