Uploaded image for project: 'Apache NiFi'
  1. Apache NiFi
  2. NIFI-8204

When Cluster Coordinator dies suddenly, is possible for Component Revisions to be inconsistent across nodes in cluster

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • None
    • 1.13.0
    • Core Framework
    • None

    Description

      I encountered a scenario in a 2-node cluster where Node 0 was the Cluster Coordinator. It suddenly died and was restarted by the RunNiFi process. The restart occurred more quickly than the zookeeper session timeout. Once the node was rejoined to the cluster, I started to see errors when attempting to modify a component that "Node xyz is unable to fulfill this request due to  [0, null, <uuid>] is not the most up-to-date revision. This component appears to have been modified."

      Refreshing the browser did not help. This indicates that nodes in the cluster have different component revisions.

      After looking through logs, here is the series of events that led to this situation:

       
      Node 0 restarts but is still Cluster Coordinator. Has topology showing all nodes disconnected, all revisions empty.
      Node 1 heartbeats to Node 0. Node 0 responds saying: Your cluster topology is wrong. node-1 should be DISCONNECTED due to Has Not Yet Connected.
      Node 1 updates topology as directed
      Node 1 becomes cluster coordinator because Node 0 hasn't yet connected and its ZooKeeper session times out
      Node 1 receives heartbeat from itself
      Node 1 determines that it hasn't yet connected (based on topology received from Node 0) so issues reconnection request.
      Node 1 changes state of Node 1 from DISCONNECTED to CONNECTING. Notifies Node 0 of the topology update.
      Node 1 relinquishes role as cluster coordinator
      Node 1 requests (to itself) to join cluster
      Node 1 receives ConnectionResponse (from itself) that includes a collection of 79 revisions
      Node 0 finishes startup. Has set of empty revisions.
      Node 0 becomes cluster coordinator
      Node 1 sends heartbeat to Node 0
      Node 0 marks Node 1 as Connected to Cluster
       
      We should address this by keeping track of the number of updates to the Revision Manager and sending this in Heartbeat messages. When the Cluster Coordinator receives a heartbeat, it should compare the update count to its own internal update count. If the heartbeat's update count is higher, it should request that the sending node reconnect to the cluster. This will ensure that if this situation were to arise again, the node would reconnect and get the most up-to-date set of revisions.

      Attachments

        Issue Links

          Activity

            People

              markap14 Mark Payne
              markap14 Mark Payne
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: