[NIFI-4475] Processors that use session.get(batchsize) will yield if multiple inbound connections exist where at least one connection is empty. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.3.0
Fix Version/s: 1.5.0
Component/s: Core Framework
Labels:
- nifi

Description

There is a difference between how the NiFi framework handles batches of incoming data (session.get(batchsize)) versus 1 FlowFile (Session.get()) at a time.

For example PutSyslog does batches and putUDP processes 1 FlowFile at a time.

With the batch method, a thread is used to poll connection 1 and requests a batch of FlowFiles. If it gets at least 1 FlowFile, it sends that FlowFile(s) and ends that thread. On next thread it round-robins to the next connection (Looped failure relationship for example) and requests a batch again. If that connection is empty, the framework assumes there is no work to do and yields the processor for the configured "yield duration". So regardless of run schedule, this processor will not run again for the configured yield duration.

With processors that only work on 1 FlowFile at a time. The thread will round-robin all the inbound connections until it finds a FlowFile. If it does not find a FlowFile in any connection the framework will yield the processor for the configured yield duration.

The intent of yield duration is to keep processors with the default runs schedule of 0 sec from using excessive CPU doing nothing; however, in the case of batches it will yield even if FlowFiles exist on another connection. This can have a huge impact on throughput performance of processors that use session.get(batchsize)

There are two possible work-arounds to this issue:

1. You should see improved performance when multiple inbound connections exist (where any connection may be normally empty) by reducing the configured yield duration. The result is better throughput but at the expense of more CPU usage when all connections are truly empty.

2. Only have one inbound connection to processor that work on batches. This can be accomplished by using a funnel.

Attachments

Issue Links

links to

GitHub Pull Request #2337

Activity

People

Assignee:: Joe Percivall

Reporter:: Matthew Clarke

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 09/Oct/17 19:22

Updated:: 14/Dec/17 20:15

Resolved:: 14/Dec/17 20:09