[NIFI-8204] When Cluster Coordinator dies suddenly, is possible for Component Revisions to be inconsistent across nodes in cluster - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.13.0
Component/s: Core Framework
Labels:
None

Description

I encountered a scenario in a 2-node cluster where Node 0 was the Cluster Coordinator. It suddenly died and was restarted by the RunNiFi process. The restart occurred more quickly than the zookeeper session timeout. Once the node was rejoined to the cluster, I started to see errors when attempting to modify a component that "Node xyz is unable to fulfill this request due to [0, null, <uuid>] is not the most up-to-date revision. This component appears to have been modified."

Refreshing the browser did not help. This indicates that nodes in the cluster have different component revisions.

After looking through logs, here is the series of events that led to this situation:

Node 0 restarts but is still Cluster Coordinator. Has topology showing all nodes disconnected, all revisions empty.
Node 1 heartbeats to Node 0. Node 0 responds saying: Your cluster topology is wrong. node-1 should be DISCONNECTED due to Has Not Yet Connected.
Node 1 updates topology as directed
Node 1 becomes cluster coordinator because Node 0 hasn't yet connected and its ZooKeeper session times out
Node 1 receives heartbeat from itself
Node 1 determines that it hasn't yet connected (based on topology received from Node 0) so issues reconnection request.
Node 1 changes state of Node 1 from DISCONNECTED to CONNECTING. Notifies Node 0 of the topology update.
Node 1 relinquishes role as cluster coordinator
Node 1 requests (to itself) to join cluster
Node 1 receives ConnectionResponse (from itself) that includes a collection of 79 revisions
Node 0 finishes startup. Has set of empty revisions.
Node 0 becomes cluster coordinator
Node 1 sends heartbeat to Node 0
Node 0 marks Node 1 as Connected to Cluster

We should address this by keeping track of the number of updates to the Revision Manager and sending this in Heartbeat messages. When the Cluster Coordinator receives a heartbeat, it should compare the update count to its own internal update count. If the heartbeat's update count is higher, it should request that the sending node reconnect to the cluster. This will ensure that if this situation were to arise again, the node would reconnect and get the most up-to-date set of revisions.

Attachments

Issue Links

links to

GitHub Pull Request #4806

Activity

People

Assignee:: Mark Payne

Reporter:: Mark Payne

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 05/Feb/21 15:50

Updated:: 05/Feb/21 20:21

Resolved:: 05/Feb/21 20:21