Uploaded image for project: 'Apache NiFi'
  1. Apache NiFi
  2. NIFI-9433

Load Balanced Connections hangs / log "Cannot create negative queue size"

    XMLWordPrintableJSON

Details

    Description

      Simplified scenario to demonstrate problem:
      A 2-node cluster with a simple flow. GenerateFlowFile -> load-balanced connection -> UpdateAttribute. And, unconnected to the first two processors, Funnel #1 -> non-load-balanced Connection -> Funnel #2.
      GenerateFlowFile is scheduled to run on Primary Node only. It is started. This causes the connection to be very busy load balancing (round robin). Then, the connection between the two funnels is removed.
      Immediately, an error is thrown, and the flow gets stuck in a state of constantly throwing errors indicating that a connection (the one just deleted) does not exist and cannot be balanced.
      It is unclear why this connection is being considered by the load balancer at all.

      The sequence of errors include the following:
      Primary Node reports
      2021-12-02 12:20:03,812 ERROR [NiFi Web Server-1811] o.a.n.c.queue.SwappablePriorityQueue Updated Size of Queue Unacknowledged from FlowFile Queue Size[ ActiveQueue=[0, 0 Bytes], Swap Queue=[0, 0 Bytes], Swap Files=[0], Unacknowledged=[0, 0 Bytes] ] to FlowFile Queue Size[ ActiveQueue=[0, 0 Bytes], Swap Queue=[0, 0 Bytes], Swap Files=[0], Unacknowledged=[-206, -20600 Bytes] ]
      java.lang.RuntimeException: Cannot create negative queue size
      2021-12-02 12:20:03,813 ERROR [NiFi Web Server-1811] o.a.n.c.queue.SwappablePriorityQueue Updated Size of Queue active from FlowFile Queue Size[ ActiveQueue=[0, 0 Bytes], Swap Queue=[0, 0 Bytes], Swap Files=[0], Unacknowledged=[-206, -20600 Bytes] ] to FlowFile Queue Size[ ActiveQueue=[206, 20600 Bytes], Swap Queue=[0, 0 Bytes], Swap Files=[0], Unacknowledged=[-206, -20600 Bytes] ]
      java.lang.RuntimeException: Cannot create negative queue size

      The above may be a symptom of subsequent errors in the log:
      Primary Node reports:
      2021-12-02 12:20:03,814 ERROR [Load-Balanced Client Thread-6] o.a.n.c.q.c.c.a.n.NioAsyncLoadBalanceClient Failed to communicate with Peer <host:port>
      java.io.IOException: Failed to negotiate Protocol Version with Peer <host:port>. Recommended version 1 but instead of an ACCEPT or REJECT response got back a response of 33.

      Non-Primary Node reports:
      2021-12-02 12:20:03,828 ERROR [Load-Balance Server Thread-4] o.a.n.c.q.c.s.ConnectionLoadBalanceServer Failed to communicate with Peer<fqdn/IP:port>
      java.io.IOException: Expected to receive Transaction Completion Indicator from Peer <fqdn> but instead received a value of 1

      The highly concerning part is this error which indicates a Connection which was not scheduled to load balance was attempting to receive a FlowFile.
      Non-Primary Node reports:
      2021-12-02 12:29:05,228 ERROR [Load-Balance Server Thread-808] o.a.n.c.q.c.s.StandardLoadBalanceProtocol Attempted to receive FlowFiles from Peer <fqdn> for Connection with ID <uuid> but no connection exists with that ID.

      Note the that <uuid> value in this message corresponds to the Connection that was removed causing the errors to begin. Should the above message ever occur? Does the load balancer ever consider Connections which are configured as "Do not load balance"

      Users have also reported that FlowFiles have been load balanced from one Connection to another, unrelated Connection on the other Node. (This is still being verified.)

      Finally, on the UI the load-balanced connection indicates it is actively load balancing some number (206 in this case) of FlowFiles currently in the connection. And, attempts to "list queue" on this connection show no FlowFiles. Presumably they are being held by the load balancer and are inaccessible in the queue.

      Attachments

        Issue Links

          Activity

            People

              markap14 Mark Payne
              markbean Mark Bean
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m