Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.3.0
Description
There is a difference between how the NiFi framework handles batches of incoming data (session.get(batchsize)) versus 1 FlowFile (Session.get()) at a time.
For example PutSyslog does batches and putUDP processes 1 FlowFile at a time.
With the batch method, a thread is used to poll connection 1 and requests a batch of FlowFiles. If it gets at least 1 FlowFile, it sends that FlowFile(s) and ends that thread. On next thread it round-robins to the next connection (Looped failure relationship for example) and requests a batch again. If that connection is empty, the framework assumes there is no work to do and yields the processor for the configured "yield duration". So regardless of run schedule, this processor will not run again for the configured yield duration.
With processors that only work on 1 FlowFile at a time. The thread will round-robin all the inbound connections until it finds a FlowFile. If it does not find a FlowFile in any connection the framework will yield the processor for the configured yield duration.
The intent of yield duration is to keep processors with the default runs schedule of 0 sec from using excessive CPU doing nothing; however, in the case of batches it will yield even if FlowFiles exist on another connection. This can have a huge impact on throughput performance of processors that use session.get(batchsize)
There are two possible work-arounds to this issue:
1. You should see improved performance when multiple inbound connections exist (where any connection may be normally empty) by reducing the configured yield duration. The result is better throughput but at the expense of more CPU usage when all connections are truly empty.
2. Only have one inbound connection to processor that work on batches. This can be accomplished by using a funnel.
Attachments
Issue Links
- links to