Uploaded image for project: 'ZooKeeper'
  1. ZooKeeper
  2. ZOOKEEPER-1606

intermittent failures in ZkDatabaseCorruptionTest on jenkins

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 3.4.5, 3.5.0
    • 3.4.6, 3.5.0
    • tests
    • Reviewed

    Description

      ZkDatabaseCorruptionTest is failing intermittently on jenkins with:

      "Error Message: the last server is not the leader"

      Seeing this on jdk7/openjdk7/solaris - 3 times in the last month.

      https://builds.apache.org/view/S-Z/view/ZooKeeper/job/ZooKeeper-trunk-openjdk7/2/testReport/junit/org.apache.zookeeper.test/ZkDatabaseCorruptionTest/testCorruption/

      Attachments

        1. zookeeper-1606.patch
          3 kB
          lixiaofeng

        Activity

          infgeoax lixiaofeng added a comment -

          There seems to be a contradiction between the comment and the code.
          //find out who is the leader and kill it
          if ( qb.s5.getPeerState() != ServerState.LEADING)

          { throw new Exception("the last server is not the leader"); }

          I'm new here so there may be a very good reason that I'm not aware of that the last server will be the leader. But the comment clearly says differently, stating the code below intend to find out who is the leader. Isn't it supposed to be s5 all the time? According to the code.

          infgeoax lixiaofeng added a comment - There seems to be a contradiction between the comment and the code. //find out who is the leader and kill it if ( qb.s5.getPeerState() != ServerState.LEADING) { throw new Exception("the last server is not the leader"); } I'm new here so there may be a very good reason that I'm not aware of that the last server will be the leader. But the comment clearly says differently, stating the code below intend to find out who is the leader. Isn't it supposed to be s5 all the time? According to the code.

          @lixiaofeng I think you're right. It's likely that this test typically works because that quorum member (s5) usually gets elected as leader (typically it's the server with the largest server id, all things otherwise being equal). But there's likely a timing issue that sometimes results in someone else becoming the leader (say s5 is delayed in startup). I've been meaning to fix this but haven't gotten around to it. feel free to give it a try. thx.

          phunt Patrick D. Hunt added a comment - @lixiaofeng I think you're right. It's likely that this test typically works because that quorum member (s5) usually gets elected as leader (typically it's the server with the largest server id, all things otherwise being equal). But there's likely a timing issue that sometimes results in someone else becoming the leader (say s5 is delayed in startup). I've been meaning to fix this but haven't gotten around to it. feel free to give it a try. thx.
          infgeoax lixiaofeng added a comment -

          Find leader

          infgeoax lixiaofeng added a comment - Find leader
          hadoopqa Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12567954/zookeeper-1606.patch
          against trunk revision 1441922.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1385//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1385//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1385//console

          This message is automatically generated.

          hadoopqa Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12567954/zookeeper-1606.patch against trunk revision 1441922. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1385//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1385//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1385//console This message is automatically generated.
          infgeoax lixiaofeng added a comment -

          So the test itself didn't consider situations when s5 is not the leader. My solution is to find the actual leader by examining each peer's peerState, using leaderSid to determine the followers. I used an list in my first attempt, but qb.setupServers() will reinstantiate all peers, invalidating all previous references. And now I realized that the leader reference is also invalid.

          infgeoax lixiaofeng added a comment - So the test itself didn't consider situations when s5 is not the leader. My solution is to find the actual leader by examining each peer's peerState, using leaderSid to determine the followers. I used an list in my first attempt, but qb.setupServers() will reinstantiate all peers, invalidating all previous references. And now I realized that the leader reference is also invalid.
          infgeoax lixiaofeng added a comment -

          Acquire reference to the leader again after server setup.
          Check the RuntimeException thrown by leader.start() to make sure we've got the IOException due to invalid datafile.

          infgeoax lixiaofeng added a comment - Acquire reference to the leader again after server setup. Check the RuntimeException thrown by leader.start() to make sure we've got the IOException due to invalid datafile.
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12569755/zookeeper-1606.patch
          against trunk revision 1446831.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1390//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1390//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1390//console

          This message is automatically generated.

          hadoopqa Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12569755/zookeeper-1606.patch against trunk revision 1446831. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1390//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1390//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1390//console This message is automatically generated.
          hadoopqa Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12569789/zookeeper-1606.patch
          against trunk revision 1446831.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1391//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1391//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1391//console

          This message is automatically generated.

          hadoopqa Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12569789/zookeeper-1606.patch against trunk revision 1446831. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1391//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1391//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1391//console This message is automatically generated.

          committed to 3.4.6 and trunk. Thanks lixiaofeng!

          phunt Patrick D. Hunt added a comment - committed to 3.4.6 and trunk. Thanks lixiaofeng!
          hudson Hudson added a comment -

          Integrated in ZooKeeper-trunk #1839 (See https://builds.apache.org/job/ZooKeeper-trunk/1839/)
          ZOOKEEPER-1606. intermittent failures in ZkDatabaseCorruptionTest on jenkins (lixiaofeng via phunt) (Revision 1447618)

          Result = SUCCESS
          phunt : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1447618
          Files :

          • /zookeeper/trunk/CHANGES.txt
          • /zookeeper/trunk/src/java/test/org/apache/zookeeper/test/ZkDatabaseCorruptionTest.java
          hudson Hudson added a comment - Integrated in ZooKeeper-trunk #1839 (See https://builds.apache.org/job/ZooKeeper-trunk/1839/ ) ZOOKEEPER-1606 . intermittent failures in ZkDatabaseCorruptionTest on jenkins (lixiaofeng via phunt) (Revision 1447618) Result = SUCCESS phunt : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1447618 Files : /zookeeper/trunk/CHANGES.txt /zookeeper/trunk/src/java/test/org/apache/zookeeper/test/ZkDatabaseCorruptionTest.java

          Closing issues after releasing 3.4.6.

          fpj Flavio Paiva Junqueira added a comment - Closing issues after releasing 3.4.6.

          People

            infgeoax lixiaofeng
            phunt Patrick D. Hunt
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: