Great catch! (I know it was hudson, but it was good that you've seen it)
The short version of the story is that the synchronization is not correct in QuorumCnxManager.
The longer version is like this. From the traces, I can see the following sequence of messages:
- Replica 1 sends a message to itself and to Replica 2 stating that its current vote is for replica 1;
- Replica 2 sends a message to itself and to Replica 1 stating that its current vote is for replica 2;
- Replica 1 updates its vote, and sends a message to itself stating that its current vote is for replica 2;
- Since replica 1 has two votes for 2 in a an ensemble of 3 replicas, replica 1 decides to follow 2.
The problem is that replica 2 does not receive a message from 1 stating that it changed its vote to 2, which prevents 2 from becoming a leader. Now looking more carefully at why that happened, you can see that when 1 tries to send a message to 2, QuorumCnxManager in 1 is both shutting down a connection to 2 at the same time that it is trying to open a new one. The incorrect synchronization prevents the creation of a new connection, and 1 and 2 end up not connected.