Uploaded image for project: 'Apache NiFi'
  1. Apache NiFi
  2. NIFI-7646

Improve performance of MergeContent / others that read content of many small FlowFiles

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.14.0, 1.13.1
    • Core Framework
    • None

    Description

      When MergeContent merges together 1,000 FlowFiles, it must read the content of each of those FlowFiles. This is done by calling `ProcessSession.read(flowFile);`
      Right now, the Process Session ends up calling `ContentRepository.read(ContentClaim)` using the Content Claim from the given FlowFile. As a result, the Content Repository creates a new FileInputStream (1+ disk accesses). It then seeks to the appropriate location on disk (1 disk access). The stream is then wrapped in a LimitingInputStream to prevent the reader from going beyond the boundaries of the associated Content Claim. So if the FlowFile is small, say 200 bytes, the result is that we perform 2+ disk accesses to read those 200 bytes (even though 4K - 8K is a typical block size and could be read in the same amount of time as those 200 bytes).

      As a result, merging 1,000 FlowFiles can result in many disk accesses and a huge degradation in performance.

      At the same ProcessSession already has a notion of the currentReadClaimStream and a currentReadClaim. We could get huge performance improvements by making a couple of small changes in Content Repo & Process Session:

      • In ContentRepository, introduce a new method: `InputStream read(ResourceClaim resourceClaim) throws IOException`. This will allow the Process Session to read the entire contents of the underlying Resource Claim if necessary. This is safe since it doesn't provide raw access to any "user code". The Process Session will protect the bounds properly.
      • ProcessSession should use this new method to access the stream for an entire ResourceClaim. It should then skip to the appropriate location, as that will not have been done by the Content Repository. Then, the InputStream should be wrapped in a BufferedInputStream. This will help for cases when a LimitingInputStream restricts reads to only 200 bytes - in this case, the disk access will still pull back 4-8K and that will live in the BufferedInputStream.
      • ProcessSession should change the currentReadClaim from a Content Claim to a Resource Claim to allow for this to work. Additionally, the getInputStream() method should relax the constraint "writeRecursionSet.isEmpty()" for reusing the stream and instead use "!writeRecursionSet.contains(flowFile)", as this will be important for MergeContent, since it will be writing to one FlowFile while reading from another.

      These changes will transparently (to the processors) provide a very significant performance gain in cases where a Processor must read the content of many small FlowFiles, if the FlowFiles all have the same Resource Claim (which is the case more often than not).

      Attachments

        Issue Links

          Activity

            People

              markap14 Mark Payne
              markap14 Mark Payne
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 50m
                  50m