Uploaded image for project: 'Apache NiFi'
  1. Apache NiFi
  2. NIFI-11557

Eliminate use of Files.walkFileTree for any performance-critical parts of application

    XMLWordPrintableJSON

Details

    Description

      The FileSystemRepository (content repo implementation) as well as ListFile both make use of the Files.walkFileTree method. Recently, I worked with a user who had horribly long startup times. Thread dumps show that the time was almost entirely in the FileSystemRepository's initializeRepository method as it is walking the file tree in order to determine which archive files can be cleaned up next. This is done during startup and again periodically in background threads.

      I made a small modification locally to instead use the standard synchronous IO methods ( File.listFiles method. I used GenerateFlowFile to generate 1-byte FlowFiles and set  nifi.content.claim.max.appendable.size=1 B in nifi.properties in order to generate a huge number of files - about 1.2 million files in the content repository and restarted a few times. Additionally, added some log lines to show how long this part of the startup process took.

      With the existing code, startup took 210 seconds (3.5 mins). With the new implementation, it took 6.7 seconds. The appears to be due to the fact that when using NIO.2 for every file, it does an individual disk access to obtain File attributes, while when using the File.listFiles method the File objects that are returned already have the necessary attributes. As a result, the NIO.2 approach makes millions of disk accesses that are unnecessary. As the number of files in the repository grows, the discrepancy also grows.

      We need to eliminate any use of File.walkFileTree for any performance-critical parts of the codebase.

      Attachments

        Issue Links

          Activity

            People

              markap14 Mark Payne
              markap14 Mark Payne
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 0.5h
                  0.5h