Uploaded image for project: 'Flume'
  1. Flume
  2. FLUME-2918

TaildirSource is underperforming with huge parent directories

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.6.0
    • Fix Version/s: 1.7.0
    • Component/s: Sinks+Sources
    • Labels:
    • Release Note:
      introduced an option in flume configuration for TAILDIR source to cache pattern matched files for huge directories
    • Flags:
      Patch

      Description

      TailDir source cause high cpu utilization, when large amount of file is sitting in the target directory. File pattern matches only a single file, but the parent directory contains about 50,000 other file.

      1. FLUME-2918-2.patch
        31 kB
        Attila Simon
      2. PerfHugeDir.java
        6 kB
        Attila Simon
      3. perftest.png
        311 kB
        Attila Simon
      4. profiling_after.png
        183 kB
        Attila Simon
      5. profiling_before.png
        515 kB
        Attila Simon
      6. test.csv
        18 kB
        Attila Simon

        Issue Links

          Activity

          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Flume-trunk-hbase-1 #164 (See https://builds.apache.org/job/Flume-trunk-hbase-1/164/)
          FLUME-2918. Speed up TaildirSource on directories with many files (mpercy: http://git-wip-us.apache.org/repos/asf/flume/repo?p=flume.git&a=commit&h=7d1e683fbd7d261fff9fcf17ad78fd8469c64905)

          • flume-ng-sources/flume-taildir-source/src/main/java/org/apache/flume/source/taildir/TailFile.java
          • flume-ng-sources/flume-taildir-source/src/test/java/org/apache/flume/source/taildir/TestTaildirMatcher.java
          • flume-ng-sources/flume-taildir-source/src/main/java/org/apache/flume/source/taildir/TaildirMatcher.java
          • flume-ng-sources/flume-taildir-source/src/main/java/org/apache/flume/source/taildir/TaildirSourceConfigurationConstants.java
          • flume-ng-sources/flume-taildir-source/src/main/java/org/apache/flume/source/taildir/ReliableTaildirEventReader.java
          • flume-ng-sources/flume-taildir-source/src/main/java/org/apache/flume/source/taildir/TaildirSource.java
          • flume-ng-sources/flume-taildir-source/src/test/java/org/apache/flume/source/taildir/TestTaildirSource.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Flume-trunk-hbase-1 #164 (See https://builds.apache.org/job/Flume-trunk-hbase-1/164/ ) FLUME-2918 . Speed up TaildirSource on directories with many files (mpercy: http://git-wip-us.apache.org/repos/asf/flume/repo?p=flume.git&a=commit&h=7d1e683fbd7d261fff9fcf17ad78fd8469c64905 ) flume-ng-sources/flume-taildir-source/src/main/java/org/apache/flume/source/taildir/TailFile.java flume-ng-sources/flume-taildir-source/src/test/java/org/apache/flume/source/taildir/TestTaildirMatcher.java flume-ng-sources/flume-taildir-source/src/main/java/org/apache/flume/source/taildir/TaildirMatcher.java flume-ng-sources/flume-taildir-source/src/main/java/org/apache/flume/source/taildir/TaildirSourceConfigurationConstants.java flume-ng-sources/flume-taildir-source/src/main/java/org/apache/flume/source/taildir/ReliableTaildirEventReader.java flume-ng-sources/flume-taildir-source/src/main/java/org/apache/flume/source/taildir/TaildirSource.java flume-ng-sources/flume-taildir-source/src/test/java/org/apache/flume/source/taildir/TestTaildirSource.java
          Hide
          mpercy Mike Percy added a comment -

          Committed to trunk.

          Thanks for the contribution, Attila!

          Show
          mpercy Mike Percy added a comment - Committed to trunk. Thanks for the contribution, Attila!
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 7d1e683fbd7d261fff9fcf17ad78fd8469c64905 in flume's branch refs/heads/trunk from Mike Percy
          [ https://git-wip-us.apache.org/repos/asf?p=flume.git;h=7d1e683 ]

          FLUME-2918. Speed up TaildirSource on directories with many files

          This patch greatly improves the performance of TaildirSource on
          directories that contain a large number of files.

          (Attila Simon via Mike Percy)

          Show
          jira-bot ASF subversion and git services added a comment - Commit 7d1e683fbd7d261fff9fcf17ad78fd8469c64905 in flume's branch refs/heads/trunk from Mike Percy [ https://git-wip-us.apache.org/repos/asf?p=flume.git;h=7d1e683 ] FLUME-2918 . Speed up TaildirSource on directories with many files This patch greatly improves the performance of TaildirSource on directories that contain a large number of files. (Attila Simon via Mike Percy)
          Hide
          mpercy Mike Percy added a comment -

          +1. I'm committing the latest version of the patch from review board.

          Show
          mpercy Mike Percy added a comment - +1. I'm committing the latest version of the patch from review board.
          Hide
          sati Attila Simon added a comment -

          Patch is available for review: https://reviews.apache.org/r/48161/

          Show
          sati Attila Simon added a comment - Patch is available for review: https://reviews.apache.org/r/48161/
          Hide
          sati Attila Simon added a comment -

          patch is considered final ready for review

          Show
          sati Attila Simon added a comment - patch is considered final ready for review
          Hide
          sati Attila Simon added a comment -

          use java.nio.file.DirectoryStream to filter files
          make pattern match calculation optionally cached
          add junit tests
          add javadoc
          add licence
          maven build passes

          Show
          sati Attila Simon added a comment - use java.nio.file.DirectoryStream to filter files make pattern match calculation optionally cached add junit tests add javadoc add licence maven build passes
          Hide
          sati Attila Simon added a comment -

          Comparing how could the same functionality be implemented clarified that using java.nio.file.DirectoryStream to list the files gives the best overall performance (only very first invocation has a JIT overhead when it performs little bit worse than the proper FileFilter). Please see attachments.

          • PerfHugeDir.java generated the execution times
          • test.csv captured result of executing PerfHugeDir.main()
          • perftest.png charted version of the csv data (execution time in millisecs comparing the different implementations)
            I started with a directory of 59k files, only a single file matched the pattern, there were couple of subdirs. After ~230 rounds I started massively removing the files not matched by the pattern and reduced the number to ~20 files all together within the parent dir which reduction was responsible for the fade out. (Secondly I ran the same test starting with empty dir and adding 300files/sec to 59k that was also won by DirectoryStream. No attachment for this.)
          Show
          sati Attila Simon added a comment - Comparing how could the same functionality be implemented clarified that using java.nio.file.DirectoryStream to list the files gives the best overall performance (only very first invocation has a JIT overhead when it performs little bit worse than the proper FileFilter). Please see attachments. PerfHugeDir.java generated the execution times test.csv captured result of executing PerfHugeDir.main() perftest.png charted version of the csv data (execution time in millisecs comparing the different implementations) I started with a directory of 59k files, only a single file matched the pattern, there were couple of subdirs. After ~230 rounds I started massively removing the files not matched by the pattern and reduced the number to ~20 files all together within the parent dir which reduction was responsible for the fade out. (Secondly I ran the same test starting with empty dir and adding 300files/sec to 59k that was also won by DirectoryStream. No attachment for this.)
          Hide
          sati Attila Simon added a comment -

          profiling_after.png shows that with the fix time spent on getMatchFiles() (same workload with occasionally (2 per min) adding new files to the directory) is reduced to 3.4% of thread time

          Show
          sati Attila Simon added a comment - profiling_after.png shows that with the fix time spent on getMatchFiles() (same workload with occasionally (2 per min) adding new files to the directory) is reduced to 3.4% of thread time
          Hide
          sati Attila Simon added a comment -

          21% of thread time is spent on ReliableTaildirEventReader.getMatches() -> File.listFiles() -> FileFilter.accept() -> File.isDirectory()

          Show
          sati Attila Simon added a comment - 21% of thread time is spent on ReliableTaildirEventReader.getMatches() -> File.listFiles() -> FileFilter.accept() -> File.isDirectory()
          Hide
          sati Attila Simon added a comment -

          After checking the control flow it turned out that the function (ReliableTaildirEventReader.getMatchFiles) - which is responsible for checking whether new files has been added or removed within the parent dir of the file pattern - is called every time when the PollableSourceRunner$PollingRunner instructed the TaildirSource to harvest new data. Even though nothing changed in that directory. This check requires listing all of the files and filtering those using a pattern match and a isDirectory check within a single if statement calling directory check first. Profiling showed that isDirectory is much more expensive call than pattern match on the filename so changing the order of the expressions would speed up the evaluation(short-circuit nature of the java evaluation of boolean expressions) hence listing the dir. On the other hand caching what was the last modification time of the parent directory and the list of matched files for each filepattern prevent unnecessary rechecks.

          Show
          sati Attila Simon added a comment - After checking the control flow it turned out that the function (ReliableTaildirEventReader.getMatchFiles) - which is responsible for checking whether new files has been added or removed within the parent dir of the file pattern - is called every time when the PollableSourceRunner$PollingRunner instructed the TaildirSource to harvest new data. Even though nothing changed in that directory. This check requires listing all of the files and filtering those using a pattern match and a isDirectory check within a single if statement calling directory check first. Profiling showed that isDirectory is much more expensive call than pattern match on the filename so changing the order of the expressions would speed up the evaluation(short-circuit nature of the java evaluation of boolean expressions) hence listing the dir. On the other hand caching what was the last modification time of the parent directory and the list of matched files for each filepattern prevent unnecessary rechecks.

            People

            • Assignee:
              sati Attila Simon
              Reporter:
              sati Attila Simon
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development