Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
2.5.0, 5.2.0
-
None
-
None
-
Production
Description
While using Curator Leader Latch Recipe in our application, we observed a potential issue where two clients have become a leader (Double Leadership Issue).
Quick summary of below description
- Our use case explained
- Issue details
- Timeline of events mentioned
- Attached test code to reproduce the reported issue
- Possible solution given, where we need your suggestions
Our use case:
- Two clients trying to get the leadership using Curator Leader Latch Recipe. On LeaderLatchListener.isLeader() Client would become a leader and on LeaderLatchListener.notLeader() Client would lose its leadership
Issue details:
- One of the clients on receiving two CuratorConnectionListener RECONNECTED events in quick succession, we observed that LeaderLatch EventThreads interleave with each other, resulting in "latch node deletion" happen after "client becoming a leader", thereby the client will still be a leader though its corresponding latch node has been deleted
- And the other client who tried to get leadership creates its latch node and sees itself in first index and thus become a leader
- So at this point, two clients have become a leader
Timeline of events:
- Timeline events of Client A whose corresponding latch node is deleted but still be a leader
- At t1, 1st RECONNECTED event fired
- At t2, [ EventThread of 1st RECONNECTED event ] Resets leadership (true -> false)
- At t3, [ EventThread of 1st RECONNECTED event ] Fire “listener.notLeader()”
- At t4, [ EventThread of 1st RECONNECTED event ] Deletes latch node
- At t5, [ EventThread of 1st RECONNECTED event ] Creates new latch node
- At t6, 2nd RECONNECTED event fired
- At t7, [ EventThread of 2nd RECONNECTED event ] Resets leadership (false -> false), Basically NOP
- At t8, [ EventThread of 2nd RECONNECTED event ] Fire nothing. Basically NOP
- At t9, [ EventThread of 1st RECONNECTED event ] Get children -> sort them -> check leadership -> Set leadership to true -> Fire “Has become a leader” leader listener event
- At t10, [ EventThread of 2nd RECONNECTED event ] Delete latch node (which actually deletes the latch node with which the Client A has become a leader through previous step)
- Timeline events of Client B who also become a leader
- At t11, Client B creates its latch node -> Get children -> sort them -> check leadership -> Set leadership to true -> Fire “Has become a leader” leader listener event
This ends up in a situation where both Client A and Client B have become a leader
As we observe, over the period (t8 -> t10), Client A’s LeaderLatch EventThreads interleave with each other causing leadership latch node deleted but local state still assumes that it’s a leader
Reproducing the issue:
- Wrote a Junit test case firing an artificial curator connection reconnected events and simulated LeaderLatch EventThreads interleave through CountDownLatches
- Test simulator for 2.5.0:
- Test Simulator for latest Curator version:
Possible Solution (where we would like to hear your thoughts/suggestions):
- The current curator code during reset() does
- setLeadership(false) first followed by
- setNode(null) (i.e. deleting its latch node)
- Swapping these two should resolve the issue, as we setting leadership to false once after its latch node gets deleted
- setNode(null) (i.e. deleting its latch node) first followed by
- setLeadership(false)
Attachments
Issue Links
- is fixed by
-
CURATOR-696 Double leader for LeaderLatch
- Resolved
- is related to
-
CURATOR-604 Double locking issue while using InterProcessMutex lock code
- Open