Uploaded image for project: 'ZooKeeper'
  1. ZooKeeper
  2. ZOOKEEPER-475

FLENewEpochTest failed on nightly builds.

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 3.2.0
    • Fix Version/s: 3.2.1, 3.3.0
    • Component/s: quorum
    • Labels:
      None

      Description

      THe flenewepochtest failed on one of the nightly builds -
      http://hudson.zones.apache.org/hudson/view/ZooKeeper/job/ZooKeeper-trunk/377.

      1. ZOOKEEPER-475.patch
        3 kB
        Flavio Junqueira
      2. ZOOKEEPER-475.patch
        11 kB
        Flavio Junqueira

        Issue Links

          Activity

          Hide
          mahadev Mahadev konar added a comment -

          flavio, can you take a look at it?
          thanks

          Show
          mahadev Mahadev konar added a comment - flavio, can you take a look at it? thanks
          Hide
          fpj Flavio Junqueira added a comment -

          Great catch! (I know it was hudson, but it was good that you've seen it)

          The short version of the story is that the synchronization is not correct in QuorumCnxManager.

          The longer version is like this. From the traces, I can see the following sequence of messages:

          • Replica 1 sends a message to itself and to Replica 2 stating that its current vote is for replica 1;
          • Replica 2 sends a message to itself and to Replica 1 stating that its current vote is for replica 2;
          • Replica 1 updates its vote, and sends a message to itself stating that its current vote is for replica 2;
          • Since replica 1 has two votes for 2 in a an ensemble of 3 replicas, replica 1 decides to follow 2.

          The problem is that replica 2 does not receive a message from 1 stating that it changed its vote to 2, which prevents 2 from becoming a leader. Now looking more carefully at why that happened, you can see that when 1 tries to send a message to 2, QuorumCnxManager in 1 is both shutting down a connection to 2 at the same time that it is trying to open a new one. The incorrect synchronization prevents the creation of a new connection, and 1 and 2 end up not connected.

          Show
          fpj Flavio Junqueira added a comment - Great catch! (I know it was hudson, but it was good that you've seen it) The short version of the story is that the synchronization is not correct in QuorumCnxManager. The longer version is like this. From the traces, I can see the following sequence of messages: Replica 1 sends a message to itself and to Replica 2 stating that its current vote is for replica 1; Replica 2 sends a message to itself and to Replica 1 stating that its current vote is for replica 2; Replica 1 updates its vote, and sends a message to itself stating that its current vote is for replica 2; Since replica 1 has two votes for 2 in a an ensemble of 3 replicas, replica 1 decides to follow 2. The problem is that replica 2 does not receive a message from 1 stating that it changed its vote to 2, which prevents 2 from becoming a leader. Now looking more carefully at why that happened, you can see that when 1 tries to send a message to 2, QuorumCnxManager in 1 is both shutting down a connection to 2 at the same time that it is trying to open a new one. The incorrect synchronization prevents the creation of a new connection, and 1 and 2 end up not connected.
          Hide
          phunt Patrick Hunt added a comment -

          the nightly build failed again last night, this time due to a failure in HierarchicalQuorumTest

          Flavio can you take a look? If it's the same issue then we're good, otw please open another jira. We really
          need to fix these asap (to get CI and the patch process up and running again):

          http://hudson.zones.apache.org/hudson/view/ZooKeeper/job/ZooKeeper-trunk/380/testReport/org.apache.zookeeper.test/HierarchicalQuorumTest/testHierarchicalQuorum/

          Show
          phunt Patrick Hunt added a comment - the nightly build failed again last night, this time due to a failure in HierarchicalQuorumTest Flavio can you take a look? If it's the same issue then we're good, otw please open another jira. We really need to fix these asap (to get CI and the patch process up and running again): http://hudson.zones.apache.org/hudson/view/ZooKeeper/job/ZooKeeper-trunk/380/testReport/org.apache.zookeeper.test/HierarchicalQuorumTest/testHierarchicalQuorum/
          Hide
          fpj Flavio Junqueira added a comment -

          Patch so far.

          Show
          fpj Flavio Junqueira added a comment - Patch so far.
          Hide
          fpj Flavio Junqueira added a comment -

          Another rough patch. It does not make any changes to cnx manager, but it adds one case to fle.

          Show
          fpj Flavio Junqueira added a comment - Another rough patch. It does not make any changes to cnx manager, but it adds one case to fle.
          Hide
          mahadev Mahadev konar added a comment -

          given ZOOKEEPER-479, ZOOKEEPER-480, ZOOKEEPER-481 have been fixed, this should be fixed.

          Show
          mahadev Mahadev konar added a comment - given ZOOKEEPER-479 , ZOOKEEPER-480 , ZOOKEEPER-481 have been fixed, this should be fixed.

            People

            • Assignee:
              fpj Flavio Junqueira
              Reporter:
              mahadev Mahadev konar
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development