[CURATOR-645] LeaderLatch generates infinite loop with two LeaderLatch instances competing for the leadership - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 5.2.0
Fix Version/s: 5.4.0
Component/s: Recipes
Labels:
None

Description

We experienced a strange behavior of the LeaderLatch in a test case in Apache Flink (see ~~FLINK-28078~~) where two LeaderLatch instances are competing for the leadership resulting in an infinite loop.

The test includes three instances of a wrapper class that has a LeaderLatch as a member. This is about ZooKeeperMultipleComponentLeaderElectionDriverTest::testLeaderElectionWithMultipleDrivers. In the test, the first LeaderLatch acquires the leadership, which results in the LeaderLatch to be closed and, as a consequence, losing the leadership. The odd thing now is that the two left-over LeaderLatch instances end up in a infinite loop as shown in the ZooKeeper server logs:

16:17:07,864 [        SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor            [] - Processing request:: sessionid:0x100cf6d9cf60000 type:getChildren2 cxid:0x21 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch
16:17:07,864 [        SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor            [] - sessionid:0x100cf6d9cf60000 type:getChildren2 cxid:0x21 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch
16:17:07,866 [        SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor            [] - Processing request:: sessionid:0x100cf6d9cf60000 type:delete cxid:0x22 zxid:0xc txntype:2 reqpath:n/a
16:17:07,866 [        SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor            [] - sessionid:0x100cf6d9cf60000 type:delete cxid:0x22 zxid:0xc txntype:2 reqpath:n/a
16:17:07,869 [        SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor            [] - Processing request:: sessionid:0x100cf6d9cf60000 type:create2 cxid:0x23 zxid:0xd txntype:15 reqpath:n/a
16:17:07,869 [        SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor            [] - sessionid:0x100cf6d9cf60000 type:create2 cxid:0x23 zxid:0xd txntype:15 reqpath:n/a
16:17:07,869 [        SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor            [] - Processing request:: sessionid:0x100cf6d9cf60000 type:getData cxid:0x24 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/latch/_c_6eb174e9-bb77-4a73-9604-531242c11c0e-latch-0000000001

It looks like the close call of the LeaderLatch with the initial leadership is in some kind of race condition deleting the corresponding ZNode and the watcher triggering reset() for the left-over LeaderLatch instances instead of retrieving the left-over children:

The reset() triggers getChildren through the LeaderLatch#getChildren after a new child is created (I would assume create2 entry in the logs before getChildren entry which is not the case; so, I might be wrong in my observation)
The callback of getChildren triggers checkLeadership.
In the meantime, the predecessor gets deleted (I'd assume because of the deterministic ordering of the events in ZK). This causes the callback in checkLeadership to fail with a NONODE event and triggering the reset of the current LeaderLatch instance which again triggers the deletion of the current's LeaderLatch's child zNode and which is executed on the server later on.

Attachments

Issue Links

Discovered while testing

FLINK-28078 ZooKeeperMultipleComponentLeaderElectionDriverTest.testLeaderElectionWithMultipleDrivers runs into timeout

Closed

relates to

FLINK-29173 Upgrade curator

Closed

Activity

People

Assignee:: Zili Chen

Reporter:: Matthias Pohl

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 30/Jun/22 13:48

Updated:: 27/Sep/22 05:40

Resolved:: 27/Sep/22 03:14