ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-475

FLENewEpochTest failed on nightly builds.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 3.2.0
    • Fix Version/s: 3.2.1, 3.3.0
    • Component/s: quorum
    • Labels:
      None

      Description

      THe flenewepochtest failed on one of the nightly builds -
      http://hudson.zones.apache.org/hudson/view/ZooKeeper/job/ZooKeeper-trunk/377.

      1. ZOOKEEPER-475.patch
        3 kB
        Flavio Junqueira
      2. ZOOKEEPER-475.patch
        11 kB
        Flavio Junqueira

        Issue Links

          Activity

          Hide
          Mahadev konar added a comment -

          flavio, can you take a look at it?
          thanks

          Show
          Mahadev konar added a comment - flavio, can you take a look at it? thanks
          Hide
          Flavio Junqueira added a comment -

          Great catch! (I know it was hudson, but it was good that you've seen it)

          The short version of the story is that the synchronization is not correct in QuorumCnxManager.

          The longer version is like this. From the traces, I can see the following sequence of messages:

          • Replica 1 sends a message to itself and to Replica 2 stating that its current vote is for replica 1;
          • Replica 2 sends a message to itself and to Replica 1 stating that its current vote is for replica 2;
          • Replica 1 updates its vote, and sends a message to itself stating that its current vote is for replica 2;
          • Since replica 1 has two votes for 2 in a an ensemble of 3 replicas, replica 1 decides to follow 2.

          The problem is that replica 2 does not receive a message from 1 stating that it changed its vote to 2, which prevents 2 from becoming a leader. Now looking more carefully at why that happened, you can see that when 1 tries to send a message to 2, QuorumCnxManager in 1 is both shutting down a connection to 2 at the same time that it is trying to open a new one. The incorrect synchronization prevents the creation of a new connection, and 1 and 2 end up not connected.

          Show
          Flavio Junqueira added a comment - Great catch! (I know it was hudson, but it was good that you've seen it) The short version of the story is that the synchronization is not correct in QuorumCnxManager. The longer version is like this. From the traces, I can see the following sequence of messages: Replica 1 sends a message to itself and to Replica 2 stating that its current vote is for replica 1; Replica 2 sends a message to itself and to Replica 1 stating that its current vote is for replica 2; Replica 1 updates its vote, and sends a message to itself stating that its current vote is for replica 2; Since replica 1 has two votes for 2 in a an ensemble of 3 replicas, replica 1 decides to follow 2. The problem is that replica 2 does not receive a message from 1 stating that it changed its vote to 2, which prevents 2 from becoming a leader. Now looking more carefully at why that happened, you can see that when 1 tries to send a message to 2, QuorumCnxManager in 1 is both shutting down a connection to 2 at the same time that it is trying to open a new one. The incorrect synchronization prevents the creation of a new connection, and 1 and 2 end up not connected.
          Hide
          Patrick Hunt added a comment -

          the nightly build failed again last night, this time due to a failure in HierarchicalQuorumTest

          Flavio can you take a look? If it's the same issue then we're good, otw please open another jira. We really
          need to fix these asap (to get CI and the patch process up and running again):

          http://hudson.zones.apache.org/hudson/view/ZooKeeper/job/ZooKeeper-trunk/380/testReport/org.apache.zookeeper.test/HierarchicalQuorumTest/testHierarchicalQuorum/

          Show
          Patrick Hunt added a comment - the nightly build failed again last night, this time due to a failure in HierarchicalQuorumTest Flavio can you take a look? If it's the same issue then we're good, otw please open another jira. We really need to fix these asap (to get CI and the patch process up and running again): http://hudson.zones.apache.org/hudson/view/ZooKeeper/job/ZooKeeper-trunk/380/testReport/org.apache.zookeeper.test/HierarchicalQuorumTest/testHierarchicalQuorum/
          Hide
          Flavio Junqueira added a comment -

          Patch so far.

          Show
          Flavio Junqueira added a comment - Patch so far.
          Hide
          Flavio Junqueira added a comment -

          Another rough patch. It does not make any changes to cnx manager, but it adds one case to fle.

          Show
          Flavio Junqueira added a comment - Another rough patch. It does not make any changes to cnx manager, but it adds one case to fle.
          Hide
          Mahadev konar added a comment -

          given ZOOKEEPER-479, ZOOKEEPER-480, ZOOKEEPER-481 have been fixed, this should be fixed.

          Show
          Mahadev konar added a comment - given ZOOKEEPER-479 , ZOOKEEPER-480 , ZOOKEEPER-481 have been fixed, this should be fixed.

            People

            • Assignee:
              Flavio Junqueira
              Reporter:
              Mahadev konar
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development