[NIFI-7646] Improve performance of MergeContent / others that read content of many small FlowFiles - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.14.0, 1.13.1
Component/s: Core Framework
Labels:
None

Description

When MergeContent merges together 1,000 FlowFiles, it must read the content of each of those FlowFiles. This is done by calling `ProcessSession.read(flowFile);`
Right now, the Process Session ends up calling `ContentRepository.read(ContentClaim)` using the Content Claim from the given FlowFile. As a result, the Content Repository creates a new FileInputStream (1+ disk accesses). It then seeks to the appropriate location on disk (1 disk access). The stream is then wrapped in a LimitingInputStream to prevent the reader from going beyond the boundaries of the associated Content Claim. So if the FlowFile is small, say 200 bytes, the result is that we perform 2+ disk accesses to read those 200 bytes (even though 4K - 8K is a typical block size and could be read in the same amount of time as those 200 bytes).

As a result, merging 1,000 FlowFiles can result in many disk accesses and a huge degradation in performance.

At the same ProcessSession already has a notion of the currentReadClaimStream and a currentReadClaim. We could get huge performance improvements by making a couple of small changes in Content Repo & Process Session:

In ContentRepository, introduce a new method: `InputStream read(ResourceClaim resourceClaim) throws IOException`. This will allow the Process Session to read the entire contents of the underlying Resource Claim if necessary. This is safe since it doesn't provide raw access to any "user code". The Process Session will protect the bounds properly.

ProcessSession should use this new method to access the stream for an entire ResourceClaim. It should then skip to the appropriate location, as that will not have been done by the Content Repository. Then, the InputStream should be wrapped in a BufferedInputStream. This will help for cases when a LimitingInputStream restricts reads to only 200 bytes - in this case, the disk access will still pull back 4-8K and that will live in the BufferedInputStream.

ProcessSession should change the currentReadClaim from a Content Claim to a Resource Claim to allow for this to work. Additionally, the getInputStream() method should relax the constraint "writeRecursionSet.isEmpty()" for reusing the stream and instead use "!writeRecursionSet.contains(flowFile)", as this will be important for MergeContent, since it will be writing to one FlowFile while reading from another.

These changes will transparently (to the processors) provide a very significant performance gain in cases where a Processor must read the content of many small FlowFiles, if the FlowFiles all have the same Resource Claim (which is the case more often than not).

Attachments

Issue Links

is fixed by

NIFI-8337 When reading FlowFile content, processor can start at wrong offset

Resolved

links to

GitHub Pull Request #4818

Activity

People

Assignee:: Mark Payne

Reporter:: Mark Payne

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 16/Jul/20 14:34

Updated:: 17/Mar/21 17:50

Resolved:: 23/Feb/21 16:00

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

50m