Uploaded image for project: 'Apache NiFi'
  1. Apache NiFi
  2. NIFI-11584

MergeContent can be more efficient in terms of disk access

    XMLWordPrintableJSON

Details

    Description

      Long ago (NIFI-516), we updated MergeContent so that when it read from a FlowFile, it asked the ProcessSession to not manage the Input Stream and instead close the InputStream when finished reading. This was done because if we had say 50,000 FlowFiles to merge together, we'd have 50,000 ProcessSessions. Since the session by default holds open the InputStream until the session is committed/rolled back, we would hold open 50,000 FileInputStreams. This would quickly lead to IOExceptions due to "too many open files". So in NIFI-516, we addressed the issue by not holding the stream open.

      Then, in NIFI-2850 we made things much more efficient by allowing FlowFiles to be moved from 1 ProcessSession to another. So now instead of using 50,000 Process Sessions, we have a single ProcessSession for the whole bin.

      However, we did not change the behavior of asking ProcessSession not to hold open the stream. We can now allow the ProcessSession to manage the InputStream as it does elsewhere.

      Additionally, looking at the codebase, MergeContent is the only component that uses this feature of the Process Session - and this is a bad practice as the ProcessSession.migrate capability makes it unnecessary to ever do this. As a result, we should deprecate the void read(FlowFile source, boolean allowSessionStreamManagement, InputStreamCallback reader) throws FlowFileAccessException method in 1.x and remove it in 2.0

      Attachments

        Issue Links

          Activity

            People

              markap14 Mark Payne
              markap14 Mark Payne
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 10m
                  10m