Uploaded image for project: 'ZooKeeper'
  1. ZooKeeper
  2. ZOOKEEPER-2778

Potential server deadlock between follower sync with leader and follower receiving external connection requests.

    Details

      Description

      It's possible to have a deadlock during recovery phase.
      Found this issue by analyzing thread dumps of "flaky" ReconfigRecoveryTest [1]. . Here is a sample thread dump that illustrates the state of the execution:

          [junit]  java.lang.Thread.State: BLOCKED
          [junit]         at  org.apache.zookeeper.server.quorum.QuorumPeer.getElectionAddress(QuorumPeer.java:686)
          [junit]         at  org.apache.zookeeper.server.quorum.QuorumCnxManager.initiateConnection(QuorumCnxManager.java:265)
          [junit]         at  org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:445)
          [junit]         at  org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:369)
          [junit]         at  org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:642)
          [junit] 
      
      
          [junit]  java.lang.Thread.State: BLOCKED
          [junit]         at  org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:472)
          [junit]         at  org.apache.zookeeper.server.quorum.QuorumPeer.connectNewPeers(QuorumPeer.java:1438)
          [junit]         at  org.apache.zookeeper.server.quorum.QuorumPeer.setLastSeenQuorumVerifier(QuorumPeer.java:1471)
          [junit]         at  org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:520)
          [junit]         at  org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:88)
          [junit]         at  org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
      

      The dead lock happens between the quorum peer thread which running the follower that doing sync with leader work, and the listener of the qcm of the same quorum peer that doing the receiving connection work. Basically to finish sync with leader, the follower needs to synchronize on both QV_LOCK and the qmc object it owns; while in the receiver thread to finish setup an incoming connection the thread needs to synchronize on both the qcm object the quorum peer owns, and the same QV_LOCK. It's easy to see the problem here is the order of acquiring two locks are different, thus depends on timing / actual execution order, two threads might end up acquiring one lock while holding another.

      [1] org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentServersAreObserversInNextConfig

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                mkedwards Michael K. Edwards
                Reporter:
                hanm Michael Han
              • Votes:
                0 Vote for this issue
                Watchers:
                8 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 8h 50m
                  8h 50m