Uploaded image for project: 'Flume'
  1. Flume
  2. FLUME-2502

Spool source's directory listing is inefficient

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.5.0
    • Fix Version/s: 1.6.0
    • Component/s: Sinks+Sources
    • Labels:
      None

      Description

      As mentioned in FLUME-2309, the directory listing can it self become the bottleneck when accessing directories with a large number of files (>1M). The fix in that JIRA added in the ability to specify `RANDOM` as a Consume-Order to avoid sorting large lists.

      The slowness of the directory listing is still un-addressed.

      1. FLUME-2502-0.patch
        7 kB
        Prateek Rungta
      2. FLUME-2502-1.patch
        3 kB
        Prateek Rungta
      3. FLUME-2502-2.patch
        3 kB
        Prateek Rungta
      4. FLUME-2502-3.patch
        3 kB
        Prateek Rungta
      5. FLUME-2502-final.patch
        8 kB
        Hari Shreedharan

        Issue Links

          Activity

          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Flume-trunk-hbase-98 #47 (See https://builds.apache.org/job/Flume-trunk-hbase-98/47/)
          FLUME-2505: Test added in FLUME-2502 is flaky (jarcec: http://git-wip-us.apache.org/repos/asf/flume/repo?p=flume.git&a=commit&h=76ddc82fb829995f2c88976d4153de132edf0a20)

          • flume-ng-core/src/test/java/org/apache/flume/client/avro/TestReliableSpoolingFileEventReader.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Flume-trunk-hbase-98 #47 (See https://builds.apache.org/job/Flume-trunk-hbase-98/47/ ) FLUME-2505 : Test added in FLUME-2502 is flaky (jarcec: http://git-wip-us.apache.org/repos/asf/flume/repo?p=flume.git&a=commit&h=76ddc82fb829995f2c88976d4153de132edf0a20 ) flume-ng-core/src/test/java/org/apache/flume/client/avro/TestReliableSpoolingFileEventReader.java
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in flume-trunk #690 (See https://builds.apache.org/job/flume-trunk/690/)
          FLUME-2505: Test added in FLUME-2502 is flaky (jarcec: http://git-wip-us.apache.org/repos/asf/flume/repo?p=flume.git&a=commit&h=76ddc82fb829995f2c88976d4153de132edf0a20)

          • flume-ng-core/src/test/java/org/apache/flume/client/avro/TestReliableSpoolingFileEventReader.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in flume-trunk #690 (See https://builds.apache.org/job/flume-trunk/690/ ) FLUME-2505 : Test added in FLUME-2502 is flaky (jarcec: http://git-wip-us.apache.org/repos/asf/flume/repo?p=flume.git&a=commit&h=76ddc82fb829995f2c88976d4153de132edf0a20 ) flume-ng-core/src/test/java/org/apache/flume/client/avro/TestReliableSpoolingFileEventReader.java
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 4976f587a76c4d9b7acd291b73cd7a60427be130 in flume's branch refs/heads/flume-1.6 from Jarek Jarcec Cecho
          [ https://git-wip-us.apache.org/repos/asf?p=flume.git;h=4976f58 ]

          FLUME-2505: Test added in FLUME-2502 is flaky

          (Hari Shreedharan via Jarek Jarcec Cecho)

          Show
          jira-bot ASF subversion and git services added a comment - Commit 4976f587a76c4d9b7acd291b73cd7a60427be130 in flume's branch refs/heads/flume-1.6 from Jarek Jarcec Cecho [ https://git-wip-us.apache.org/repos/asf?p=flume.git;h=4976f58 ] FLUME-2505 : Test added in FLUME-2502 is flaky (Hari Shreedharan via Jarek Jarcec Cecho)
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 76ddc82fb829995f2c88976d4153de132edf0a20 in flume's branch refs/heads/trunk from Jarek Jarcec Cecho
          [ https://git-wip-us.apache.org/repos/asf?p=flume.git;h=76ddc82 ]

          FLUME-2505: Test added in FLUME-2502 is flaky

          (Hari Shreedharan via Jarek Jarcec Cecho)

          Show
          jira-bot ASF subversion and git services added a comment - Commit 76ddc82fb829995f2c88976d4153de132edf0a20 in flume's branch refs/heads/trunk from Jarek Jarcec Cecho [ https://git-wip-us.apache.org/repos/asf?p=flume.git;h=76ddc82 ] FLUME-2505 : Test added in FLUME-2502 is flaky (Hari Shreedharan via Jarek Jarcec Cecho)
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Flume-trunk-hbase-98 #36 (See https://builds.apache.org/job/Flume-trunk-hbase-98/36/)
          FLUME-2502. Improve Spool Directory Source's performance by not listing files each time. (hshreedharan: http://git-wip-us.apache.org/repos/asf/flume/repo?p=flume.git&a=commit&h=f979b2683fc48d85806ae7593ee0e393bd812260)

          • flume-ng-core/src/main/java/org/apache/flume/client/avro/ReliableSpoolingFileEventReader.java
          • flume-ng-core/src/test/java/org/apache/flume/client/avro/TestReliableSpoolingFileEventReader.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Flume-trunk-hbase-98 #36 (See https://builds.apache.org/job/Flume-trunk-hbase-98/36/ ) FLUME-2502 . Improve Spool Directory Source's performance by not listing files each time. (hshreedharan: http://git-wip-us.apache.org/repos/asf/flume/repo?p=flume.git&a=commit&h=f979b2683fc48d85806ae7593ee0e393bd812260 ) flume-ng-core/src/main/java/org/apache/flume/client/avro/ReliableSpoolingFileEventReader.java flume-ng-core/src/test/java/org/apache/flume/client/avro/TestReliableSpoolingFileEventReader.java
          Hide
          hudson Hudson added a comment -

          UNSTABLE: Integrated in flume-trunk #677 (See https://builds.apache.org/job/flume-trunk/677/)
          FLUME-2502. Improve Spool Directory Source's performance by not listing files each time. (hshreedharan: http://git-wip-us.apache.org/repos/asf/flume/repo?p=flume.git&a=commit&h=f979b2683fc48d85806ae7593ee0e393bd812260)

          • flume-ng-core/src/main/java/org/apache/flume/client/avro/ReliableSpoolingFileEventReader.java
          • flume-ng-core/src/test/java/org/apache/flume/client/avro/TestReliableSpoolingFileEventReader.java
          Show
          hudson Hudson added a comment - UNSTABLE: Integrated in flume-trunk #677 (See https://builds.apache.org/job/flume-trunk/677/ ) FLUME-2502 . Improve Spool Directory Source's performance by not listing files each time. (hshreedharan: http://git-wip-us.apache.org/repos/asf/flume/repo?p=flume.git&a=commit&h=f979b2683fc48d85806ae7593ee0e393bd812260 ) flume-ng-core/src/main/java/org/apache/flume/client/avro/ReliableSpoolingFileEventReader.java flume-ng-core/src/test/java/org/apache/flume/client/avro/TestReliableSpoolingFileEventReader.java
          Hide
          hshreedharan Hari Shreedharan added a comment -

          Committed! Thanks Prateek!

          Show
          hshreedharan Hari Shreedharan added a comment - Committed! Thanks Prateek!
          Hide
          hshreedharan Hari Shreedharan added a comment -

          The final patch that I committed - the last patch + unit test I added.

          Show
          hshreedharan Hari Shreedharan added a comment - The final patch that I committed - the last patch + unit test I added.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 25f331d0e1ede0408412eea014b17efa6d7bd24f in flume's branch refs/heads/flume-1.6 from Hari Shreedharan
          [ https://git-wip-us.apache.org/repos/asf?p=flume.git;h=25f331d ]

          FLUME-2502. Improve Spool Directory Source's performance by not listing files each time.

          (Prateek Rungta via Hari)

          Show
          jira-bot ASF subversion and git services added a comment - Commit 25f331d0e1ede0408412eea014b17efa6d7bd24f in flume's branch refs/heads/flume-1.6 from Hari Shreedharan [ https://git-wip-us.apache.org/repos/asf?p=flume.git;h=25f331d ] FLUME-2502 . Improve Spool Directory Source's performance by not listing files each time. (Prateek Rungta via Hari)
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit f979b2683fc48d85806ae7593ee0e393bd812260 in flume's branch refs/heads/trunk from Hari Shreedharan
          [ https://git-wip-us.apache.org/repos/asf?p=flume.git;h=f979b26 ]

          FLUME-2502. Improve Spool Directory Source's performance by not listing files each time.

          (Prateek Rungta via Hari)

          Show
          jira-bot ASF subversion and git services added a comment - Commit f979b2683fc48d85806ae7593ee0e393bd812260 in flume's branch refs/heads/trunk from Hari Shreedharan [ https://git-wip-us.apache.org/repos/asf?p=flume.git;h=f979b26 ] FLUME-2502 . Improve Spool Directory Source's performance by not listing files each time. (Prateek Rungta via Hari)
          Hide
          hshreedharan Hari Shreedharan added a comment -

          +1. I am adding a unit test and committing this.

          Show
          hshreedharan Hari Shreedharan added a comment - +1. I am adding a unit test and committing this.
          Hide
          prateekrungta Prateek Rungta added a comment -

          Awesome, thanks again!

          On Wed, Oct 15, 2014 at 8:13 PM, Hari Shreedharan (JIRA) <jira@apache.org>

          Show
          prateekrungta Prateek Rungta added a comment - Awesome, thanks again! On Wed, Oct 15, 2014 at 8:13 PM, Hari Shreedharan (JIRA) <jira@apache.org>
          Hide
          hshreedharan Hari Shreedharan added a comment -

          This looks good. I will run tests and commit this later today.

          Show
          hshreedharan Hari Shreedharan added a comment - This looks good. I will run tests and commit this later today.
          Hide
          hshreedharan Hari Shreedharan added a comment -

          Cool! Looking at it now.

          Show
          hshreedharan Hari Shreedharan added a comment - Cool! Looking at it now.
          Hide
          prateekrungta Prateek Rungta added a comment -

          Attached `FLUME-2502-3` which uses an Iterator instead of the list copy. Even the tests are like 5x faster than `FLUME-2502-2` version. Good call Hari Shreedharan!

          Show
          prateekrungta Prateek Rungta added a comment - Attached ` FLUME-2502 -3` which uses an Iterator instead of the list copy. Even the tests are like 5x faster than ` FLUME-2502 -2` version. Good call Hari Shreedharan !
          Hide
          hshreedharan Hari Shreedharan added a comment -

          I like the fact that you are removing the last element - that is pretty efficient, since that means you are not moving every element. Using an iterator won't allow you to remove the last one without traversing the entire list. I'd keep the current remove(size-1) code, but just do the copy if the list is not already not an ArrayList.

          Show
          hshreedharan Hari Shreedharan added a comment - I like the fact that you are removing the last element - that is pretty efficient, since that means you are not moving every element. Using an iterator won't allow you to remove the last one without traversing the entire list. I'd keep the current remove(size-1) code, but just do the copy if the list is not already not an ArrayList.
          Hide
          hshreedharan Hari Shreedharan added a comment -

          In java 6 and 7, Arrays.asList returns an ArrayList. You could simply check if it is an instance of ArrayList and cast it. Else do what is currently being done. It is a hack, but it can reduce one array copy - which can be really expensive if the number of new files is huge.

          Show
          hshreedharan Hari Shreedharan added a comment - In java 6 and 7, Arrays.asList returns an ArrayList. You could simply check if it is an instance of ArrayList and cast it. Else do what is currently being done. It is a hack, but it can reduce one array copy - which can be really expensive if the number of new files is huge.
          Hide
          prateekrungta Prateek Rungta added a comment -

          Added a new patch with updates per your comments.

          Re: the ArrayList comment – I needed the ArrayList (instead of a List) for the `remove` to be efficient so I copied the `List` to minimize code modifications. If you prefer, I can get rid of the copy by using an iterator to track the state.

          Show
          prateekrungta Prateek Rungta added a comment - Added a new patch with updates per your comments. Re: the ArrayList comment – I needed the ArrayList (instead of a List) for the `remove` to be efficient so I copied the `List` to minimize code modifications. If you prefer, I can get rid of the copy by using an iterator to track the state.
          Hide
          hshreedharan Hari Shreedharan added a comment - - edited

          Patch looks good in general. Minor comments, mostly formatting:

          • No spaces after < and before > for generic declarations.
          • The { should be on the same line as ) for if conditionals.
          • No spaces after ( and before ) in candidateFiles.remove( candidateFiles.size() - 1 );
          • In the if condition, the RANDOM check should be first, since that means you can avoid the other two tests for non-random cases. In Random cases, all 3 checks are required anyway.
          • In this: new ArrayList< File > (Arrays.asList(spoolDirectory.listFiles(filter))); you don't need to create a new ArrayList - it creates a copy.
          Show
          hshreedharan Hari Shreedharan added a comment - - edited Patch looks good in general. Minor comments, mostly formatting: No spaces after < and before > for generic declarations. The { should be on the same line as ) for if conditionals. No spaces after ( and before ) in candidateFiles.remove( candidateFiles.size() - 1 ); In the if condition, the RANDOM check should be first, since that means you can avoid the other two tests for non-random cases. In Random cases, all 3 checks are required anyway. In this: new ArrayList< File > (Arrays.asList(spoolDirectory.listFiles(filter))); you don't need to create a new ArrayList - it creates a copy.
          Hide
          prateekrungta Prateek Rungta added a comment -

          Adding new patch which consolidates "BATCH" and "RANDOM" consumption orders.

          Re your comments:

          • The new code uses the same filter as the existing code base for retrieving file listing so no issue there.
          • If the File does not exist after it has been listed from the directory:
            The code base to handle it is unchanged from what was done earlier. My understanding is the `openFile` method will handle the errors as you indicated - i.e. log an error and move on.
          Show
          prateekrungta Prateek Rungta added a comment - Adding new patch which consolidates "BATCH" and "RANDOM" consumption orders. Re your comments: The new code uses the same filter as the existing code base for retrieving file listing so no issue there. If the File does not exist after it has been listed from the directory: The code base to handle it is unchanged from what was done earlier. My understanding is the `openFile` method will handle the errors as you indicated - i.e. log an error and move on.
          Hide
          prateekrungta Prateek Rungta added a comment -

          Hari Shreedharan cool! I debated putting it in the same place as "RANDOM". The only reason I wanted to put it in a separate section was: it changes the amount of memory used by the Agent. I haven't profiled it to know how much memory it now uses, and I didn't want to affect existing users of "RANDOM". But I'm not a fan of the new consume order either.

          I'll defer to your judgement on which is the lesser of the two evils

          Show
          prateekrungta Prateek Rungta added a comment - Hari Shreedharan cool! I debated putting it in the same place as "RANDOM". The only reason I wanted to put it in a separate section was: it changes the amount of memory used by the Agent. I haven't profiled it to know how much memory it now uses, and I didn't want to affect existing users of "RANDOM". But I'm not a fan of the new consume order either. I'll defer to your judgement on which is the lesser of the two evils
          Hide
          prateekrungta Prateek Rungta added a comment -

          Hari Shreedharan I'm not familiar with that API, but it looks like exactly what's need in the long run. In the interim, can we apply this patch?

          Without applying the patch, the Spool Source is unusable for large directories. And this is exacerbated when using the BlobDeserializer.

          Show
          prateekrungta Prateek Rungta added a comment - Hari Shreedharan I'm not familiar with that API, but it looks like exactly what's need in the long run. In the interim, can we apply this patch? Without applying the patch, the Spool Source is unusable for large directories. And this is exacerbated when using the BlobDeserializer.
          Hide
          hshreedharan Hari Shreedharan added a comment -

          Thanks for the patch! Instead of adding a separate config param for this - lets make this the default behavior for random. Just ensure that you have all edge cases covered like a listed file is not available anymore - log an error and move on to the next file. Also ensure that the original filter is respected.

          Show
          hshreedharan Hari Shreedharan added a comment - Thanks for the patch! Instead of adding a separate config param for this - lets make this the default behavior for random. Just ensure that you have all edge cases covered like a listed file is not available anymore - log an error and move on to the next file. Also ensure that the original filter is respected.
          Hide
          prateekrungta Prateek Rungta added a comment -

          Initial Attempt with "BATCH" Consume Order

          Show
          prateekrungta Prateek Rungta added a comment - Initial Attempt with "BATCH" Consume Order
          Hide
          hshreedharan Hari Shreedharan added a comment -

          Once we support Java 7+ only, we can use the WatchService to make sure we don't do a file listing each time, instead keep track of the files as they get added.

          Show
          hshreedharan Hari Shreedharan added a comment - Once we support Java 7+ only, we can use the WatchService to make sure we don't do a file listing each time, instead keep track of the files as they get added.
          Hide
          prateekrungta Prateek Rungta added a comment -

          I found that by saving the directory listing across runs, I was able to get a 10x performance boost in the rate of consumption vs the `RANDOM` consume order. Also, this rate of consumption was more stable as the directory size grows.

          I've attached a patch with the changes in a new consume order: "BATCH".

          Show
          prateekrungta Prateek Rungta added a comment - I found that by saving the directory listing across runs, I was able to get a 10x performance boost in the rate of consumption vs the `RANDOM` consume order. Also, this rate of consumption was more stable as the directory size grows. I've attached a patch with the changes in a new consume order: "BATCH".

            People

            • Assignee:
              prateekrungta Prateek Rungta
              Reporter:
              prateekrungta Prateek Rungta
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development