Uploaded image for project: 'Apache Curator'
  1. Apache Curator
  2. CURATOR-645

LeaderLatch generates infinite loop with two LeaderLatch instances competing for the leadership

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 5.2.0
    • 5.4.0
    • Recipes
    • None

    Description

      We experienced a strange behavior of the LeaderLatch in a test case in Apache Flink (see FLINK-28078) where two LeaderLatch instances are competing for the leadership resulting in an infinite loop.

      The test includes three instances of a wrapper class that has a LeaderLatch as a member. This is about ZooKeeperMultipleComponentLeaderElectionDriverTest::testLeaderElectionWithMultipleDrivers. In the test, the first LeaderLatch acquires the leadership, which results in the LeaderLatch to be closed and, as a consequence, losing the leadership. The odd thing now is that the two left-over LeaderLatch instances end up in a infinite loop as shown in the ZooKeeper server logs:

      16:17:07,864 [        SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor            [] - Processing request:: sessionid:0x100cf6d9cf60000 type:getChildren2 cxid:0x21 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch
      16:17:07,864 [        SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor            [] - sessionid:0x100cf6d9cf60000 type:getChildren2 cxid:0x21 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch
      16:17:07,866 [        SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor            [] - Processing request:: sessionid:0x100cf6d9cf60000 type:delete cxid:0x22 zxid:0xc txntype:2 reqpath:n/a
      16:17:07,866 [        SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor            [] - sessionid:0x100cf6d9cf60000 type:delete cxid:0x22 zxid:0xc txntype:2 reqpath:n/a
      16:17:07,869 [        SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor            [] - Processing request:: sessionid:0x100cf6d9cf60000 type:create2 cxid:0x23 zxid:0xd txntype:15 reqpath:n/a
      16:17:07,869 [        SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor            [] - sessionid:0x100cf6d9cf60000 type:create2 cxid:0x23 zxid:0xd txntype:15 reqpath:n/a
      16:17:07,869 [        SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor            [] - Processing request:: sessionid:0x100cf6d9cf60000 type:getData cxid:0x24 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch/_c_6eb174e9-bb77-4a73-9604-531242c11c0e-latch-0000000001
      

      It looks like the close call of the LeaderLatch with the initial leadership is in some kind of race condition deleting the corresponding ZNode and the watcher triggering reset() for the left-over LeaderLatch instances instead of retrieving the left-over children:

      1. The reset() triggers getChildren through the LeaderLatch#getChildren after a new child is created (I would assume create2 entry in the logs before getChildren entry which is not the case; so, I might be wrong in my observation)
      2. The callback of getChildren triggers checkLeadership.
      3. In the meantime, the predecessor gets deleted (I'd assume because of the deterministic ordering of the events in ZK). This causes the callback in checkLeadership to fail with a NONODE event and triggering the reset of the current LeaderLatch instance which again triggers the deletion of the current's LeaderLatch's child zNode and which is executed on the server later on.

      Attachments

        Issue Links

          Activity

            People

              tison Zili Chen
              mapohl Matthias Pohl
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: