Uploaded image for project: 'Flume'
  1. Flume
  2. FLUME-2911

Add includePattern option in SpoolDirectorySource configuration

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: v1.6.0, v1.7.0
    • Fix Version/s: v1.7.0
    • Component/s: Sinks+Sources
    • Labels:
    • Release Note:
      Added includePattern option to spooling directory source.
    • Flags:
      Patch

      Description

      Current implementation of SpoolDirectorySource does not allow users to specify a regex pattern to select which files should be monitored. Instead, the current implementation allows users to specify which should not monitored, via the ignorePattern parameter.

      I implemented the feature, allowing users to specify the include pattern as a1.sources.src-1.includePattern=^foo.*$ (includes all the files that starts in "foo").

      By default, the includePattern regex is set to ^.*$ (all files). Include and exclude patterns can be used at same time and the results are combined.

        Issue Links

          Activity

          Hide
          arota Andrea Rota added a comment -

          git diff > FLUME-2911.patch

          Show
          arota Andrea Rota added a comment - git diff > FLUME-2911 .patch
          Hide
          vmanzoni Vincenzo Manzoni added a comment -

          Very interesting feature, Andrea!

          I have applications that generate log files with different file name formats. I am interested on transferring just some of them. So far, I had troubles on defining negative regex. This patch will make my life much easier. I hope it will be accepted.

          Show
          vmanzoni Vincenzo Manzoni added a comment - Very interesting feature, Andrea! I have applications that generate log files with different file name formats. I am interested on transferring just some of them. So far, I had troubles on defining negative regex. This patch will make my life much easier. I hope it will be accepted.
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user andrearota opened a pull request:

          https://github.com/apache/flume/pull/60

          FLUME-2911. Added include pattern option in SpoolDir source

          Current implementation of `SpoolDirectorySource` does not allow users to specify a regex pattern to select which files should be monitored. Instead, the current implementation allows users to specify which should not monitored, via the `ignorePattern` parameter.

          We implemented the feature, allowing users to specify the include pattern as `a1.sources.src-1.includePattern=^foo.*$` (includes all the files that starts in "foo").

          By default, the `includePattern` regex is set to `^.*$` (all files).

          Include and exclude patterns can be used at same time and the results are combined.

          We also opened this JIRA issue: https://issues.apache.org/jira/browse/FLUME-2911

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/tenaris/flume FLUME-2911

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/flume/pull/60.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #60


          commit 39eb89d8d86abe9a4111e44e8fff6bf3bb80fa65
          Author: Andrea Rota <andrearota37354@gmail.com>
          Date: 2016-08-04T08:09:16Z

          FLUME-2911. Added include pattern option in SpoolDir source


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user andrearota opened a pull request: https://github.com/apache/flume/pull/60 FLUME-2911 . Added include pattern option in SpoolDir source Current implementation of `SpoolDirectorySource` does not allow users to specify a regex pattern to select which files should be monitored. Instead, the current implementation allows users to specify which should not monitored, via the `ignorePattern` parameter. We implemented the feature, allowing users to specify the include pattern as `a1.sources.src-1.includePattern=^foo.*$` (includes all the files that starts in "foo"). By default, the `includePattern` regex is set to `^.*$` (all files). Include and exclude patterns can be used at same time and the results are combined. We also opened this JIRA issue: https://issues.apache.org/jira/browse/FLUME-2911 You can merge this pull request into a Git repository by running: $ git pull https://github.com/tenaris/flume FLUME-2911 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flume/pull/60.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #60 commit 39eb89d8d86abe9a4111e44e8fff6bf3bb80fa65 Author: Andrea Rota <andrearota37354@gmail.com> Date: 2016-08-04T08:09:16Z FLUME-2911 . Added include pattern option in SpoolDir source
          Hide
          bessbd Bessenyei Balázs Donát added a comment -

          Hi Andrea Rota,

          Thank you for the ticket and the patch.

          Can you please provide an example use case for this issue? (A sample config or such would be appreciated when ignorePattern alone cannot solve the problem)

          Also, I couldn't find the precedence of the options in the documentation. (Ie. what happens if a path is matched by both the ignorePattern and the includePattern?) By looking at https://github.com/apache/flume/pull/60/commits/39eb89d8d86abe9a4111e44e8fff6bf3bb80fa65#diff-3cf9e63bd6dd6833df378a0e3a5f8c53R255 it's pretty obvious, but the users won't necessarily spend the time searching the relevant code parts.

          Show
          bessbd Bessenyei Balázs Donát added a comment - Hi Andrea Rota , Thank you for the ticket and the patch. Can you please provide an example use case for this issue? (A sample config or such would be appreciated when ignorePattern alone cannot solve the problem) Also, I couldn't find the precedence of the options in the documentation. (Ie. what happens if a path is matched by both the ignorePattern and the includePattern?) By looking at https://github.com/apache/flume/pull/60/commits/39eb89d8d86abe9a4111e44e8fff6bf3bb80fa65#diff-3cf9e63bd6dd6833df378a0e3a5f8c53R255 it's pretty obvious, but the users won't necessarily spend the time searching the relevant code parts.
          Hide
          arota Andrea Rota added a comment -

          Hi Bessenyei Balázs Donát, sure!

          This is the specific case we dealt: we had a folder (D:/flume/spooldir) where we didn't know in advance what kind of files where written by several processes. We just wanted to transport the one ending with .READY extension.

          remote.sources.dirSource.type = spooldir
          remote.sources.dirSource.channels = fileChannel
          remote.sources.dirSource.spoolDir = D:/flume/spooldir
          remote.sources.dirSource.includePattern = ^.*\.READY$
          remote.sources.dirSource.fileHeader = true
          remote.sources.dirSource.deletePolicy = immediate

          Declaring this kind of condition with ignorePattern requires a negative regex, which is very tricky and needs to be updated when a new type of file appears in the folder.

          When both the ignorePattern and includePattern matches, the code stays on the safe side and ignore the files. Do you want me to edit the documentation?

          Cheers

          Show
          arota Andrea Rota added a comment - Hi Bessenyei Balázs Donát , sure! This is the specific case we dealt: we had a folder (D:/flume/spooldir) where we didn't know in advance what kind of files where written by several processes. We just wanted to transport the one ending with .READY extension. remote.sources.dirSource.type = spooldir remote.sources.dirSource.channels = fileChannel remote.sources.dirSource.spoolDir = D:/flume/spooldir remote.sources.dirSource.includePattern = ^.*\.READY$ remote.sources.dirSource.fileHeader = true remote.sources.dirSource.deletePolicy = immediate Declaring this kind of condition with ignorePattern requires a negative regex, which is very tricky and needs to be updated when a new type of file appears in the folder. When both the ignorePattern and includePattern matches, the code stays on the safe side and ignore the files. Do you want me to edit the documentation? Cheers
          Hide
          arota Andrea Rota added a comment -

          Hi Bessenyei Balázs Donát, I have updated the documentation on the PR. Let me know if everything is fine.

          Show
          arota Andrea Rota added a comment - Hi Bessenyei Balázs Donát , I have updated the documentation on the PR. Let me know if everything is fine.
          Hide
          bessbd Bessenyei Balázs Donát added a comment -

          Hi Andrea Rota,

          Thank you for the quick response and action.

          On a first glance, the PR looks fine to me, however I don't really understand what is the benefit of adding a new config option instead of using ignorePattern with negative regexes?
          Could you please shed some light on this?

          Show
          bessbd Bessenyei Balázs Donát added a comment - Hi Andrea Rota , Thank you for the quick response and action. On a first glance, the PR looks fine to me, however I don't really understand what is the benefit of adding a new config option instead of using ignorePattern with negative regexes? Could you please shed some light on this?
          Hide
          arota Andrea Rota added a comment -

          Hello Bessenyei Balázs Donát, of course I can.

          Assume you have a folder where many processes write files, and in these files there are some .log files you are interested in transmitting with Flume. The processes are not under your control, and they can pollute the folder with other file types, such as .tmp files, .txt files, .dat files and so on.

          Since you don't have the control of these processes, you are not able to tell in advance what kind of file you want to ignore, but for sure you know what you want to keep. This is a real world example, as we have processes made by third parties and on which we do not have any control.

          If you configure Flume with ignorePattern = ^.*\.[TMP|TXT|DAT]$, you will transmit .log files, but you may also send any other garbage file that you did not considered while writing the regex. Instead, if you can use the proposed includePattern, you would just declare includePattern = ^.*\.log$.

          Of course you can negate the include pattern regex and use it as ignore, such as explained in http://stackoverflow.com/questions/2637675/how-to-negate-the-whole-regex but that negative lookahead is quite tricky and applying double negation (ignore + negative lookahead) sounds innatural to me.

          What do you think?

          Show
          arota Andrea Rota added a comment - Hello Bessenyei Balázs Donát , of course I can. Assume you have a folder where many processes write files, and in these files there are some .log files you are interested in transmitting with Flume. The processes are not under your control, and they can pollute the folder with other file types, such as .tmp files, .txt files, .dat files and so on. Since you don't have the control of these processes, you are not able to tell in advance what kind of file you want to ignore, but for sure you know what you want to keep. This is a real world example, as we have processes made by third parties and on which we do not have any control. If you configure Flume with ignorePattern = ^.*\.[TMP|TXT|DAT]$ , you will transmit .log files, but you may also send any other garbage file that you did not considered while writing the regex. Instead, if you can use the proposed includePattern , you would just declare includePattern = ^.*\.log$ . Of course you can negate the include pattern regex and use it as ignore, such as explained in http://stackoverflow.com/questions/2637675/how-to-negate-the-whole-regex but that negative lookahead is quite tricky and applying double negation (ignore + negative lookahead) sounds innatural to me. What do you think?
          Hide
          bessbd Bessenyei Balázs Donát added a comment -

          Hello Andrea Rota,

          Thank you for the changes.

          If you could please provide one more test where there is a conflict between includePattern and ignorePattern (to show that the configuration option works as documented), it would be super useful.

          Otherwise the patch looks good to me, +1.

          Thank you

          Donat

          Show
          bessbd Bessenyei Balázs Donát added a comment - Hello Andrea Rota , Thank you for the changes. If you could please provide one more test where there is a conflict between includePattern and ignorePattern (to show that the configuration option works as documented), it would be super useful. Otherwise the patch looks good to me, +1. Thank you Donat
          Hide
          arota Andrea Rota added a comment -

          Hello Bessenyei Balázs Donát, with commit ce36ceca you will find two additional tests. Tests use a ReliableSpoolingFileEventReader with ignorePattern and includePattern configured at the same time. The former will test situations when the two options are not in conflict (i.e. apply on different files), the latter will test what happen when both apply on the same file.

          Let me know if you need more information.

          Andrea

          Show
          arota Andrea Rota added a comment - Hello Bessenyei Balázs Donát , with commit ce36ceca you will find two additional tests. Tests use a ReliableSpoolingFileEventReader with ignorePattern and includePattern configured at the same time. The former will test situations when the two options are not in conflict (i.e. apply on different files), the latter will test what happen when both apply on the same file. Let me know if you need more information. Andrea
          Hide
          bessbd Bessenyei Balázs Donát added a comment -

          +1, LGTM

          I've checked, the tests in flume-ng-core all run successfully after applying this patch.

          Thank you, Andrea Rota

          Show
          bessbd Bessenyei Balázs Donát added a comment - +1, LGTM I've checked, the tests in flume-ng-core all run successfully after applying this patch. Thank you, Andrea Rota
          Hide
          arota Andrea Rota added a comment -

          Hello Bessenyei Balázs Donát, can you explain me how is the process now, after the LGTM, to get the patch in the next release? Cheers

          Show
          arota Andrea Rota added a comment - Hello Bessenyei Balázs Donát , can you explain me how is the process now, after the LGTM, to get the patch in the next release? Cheers
          Hide
          denes Denes Arvay added a comment -

          Hi Andrea Rota, I've also commented on your pull request. Please correct those issues (nothing serious) and I can +1 it too, then it'll be committed by a Flume committer. Thanks.

          Show
          denes Denes Arvay added a comment - Hi Andrea Rota , I've also commented on your pull request. Please correct those issues (nothing serious) and I can +1 it too, then it'll be committed by a Flume committer. Thanks.
          Hide
          arota Andrea Rota added a comment -

          Hello Denes Arvay, issues corrected and pushed into the PR. Thank you for the suggestions.

          Show
          arota Andrea Rota added a comment - Hello Denes Arvay , issues corrected and pushed into the PR. Thank you for the suggestions.
          Hide
          arota Andrea Rota added a comment -

          Hi Denes Arvay, do you have any news about the PR?

          Show
          arota Andrea Rota added a comment - Hi Denes Arvay , do you have any news about the PR?
          Hide
          mpercy Mike Percy added a comment -

          +1, I am about to commit this patch

          Show
          mpercy Mike Percy added a comment - +1, I am about to commit this patch
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 7d5ceacac49f5d15bf8f75e0209592c5524a3dda in flume's branch refs/heads/trunk from Andrea Rota
          [ https://git-wip-us.apache.org/repos/asf?p=flume.git;h=7d5ceac ]

          FLUME-2911. Add include pattern option in SpoolDir source

          • Documented what happens when ignorePattern and includePattern both
            match for a given file.
          • Added two tests to simulate what happens when both ignorePattern and
            includePattern options are specified
          • Refactored of ReliableSpoolingFileEventReader test and fix of code
            style violations

          Closes #60

          Reviewers: Bessenyei Balázs Donát, Denes Arvay, Attila Simon

          (Andrea Rota via Mike Percy)

          Show
          jira-bot ASF subversion and git services added a comment - Commit 7d5ceacac49f5d15bf8f75e0209592c5524a3dda in flume's branch refs/heads/trunk from Andrea Rota [ https://git-wip-us.apache.org/repos/asf?p=flume.git;h=7d5ceac ] FLUME-2911 . Add include pattern option in SpoolDir source Documented what happens when ignorePattern and includePattern both match for a given file. Added two tests to simulate what happens when both ignorePattern and includePattern options are specified Refactored of ReliableSpoolingFileEventReader test and fix of code style violations Closes #60 Reviewers: Bessenyei Balázs Donát, Denes Arvay, Attila Simon (Andrea Rota via Mike Percy)
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/flume/pull/60

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/flume/pull/60
          Hide
          mpercy Mike Percy added a comment -

          Pushed to trunk. Thanks for the patch, Andrea!

          Show
          mpercy Mike Percy added a comment - Pushed to trunk. Thanks for the patch, Andrea!
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Jenkins build Flume-trunk-hbase-1 #208 (See https://builds.apache.org/job/Flume-trunk-hbase-1/208/)
          FLUME-2911. Add include pattern option in SpoolDir source (mpercy: http://git-wip-us.apache.org/repos/asf/flume/repo?p=flume.git&a=commit&h=7d5ceacac49f5d15bf8f75e0209592c5524a3dda)

          • (edit) flume-ng-doc/sphinx/FlumeUserGuide.rst
          • (edit) flume-ng-core/src/main/java/org/apache/flume/client/avro/ReliableSpoolingFileEventReader.java
          • (edit) flume-ng-core/src/main/java/org/apache/flume/source/SpoolDirectorySource.java
          • (edit) flume-ng-core/src/test/java/org/apache/flume/client/avro/TestReliableSpoolingFileEventReader.java
          • (edit) flume-ng-core/src/main/java/org/apache/flume/source/SpoolDirectorySourceConfigurationConstants.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Jenkins build Flume-trunk-hbase-1 #208 (See https://builds.apache.org/job/Flume-trunk-hbase-1/208/ ) FLUME-2911 . Add include pattern option in SpoolDir source (mpercy: http://git-wip-us.apache.org/repos/asf/flume/repo?p=flume.git&a=commit&h=7d5ceacac49f5d15bf8f75e0209592c5524a3dda ) (edit) flume-ng-doc/sphinx/FlumeUserGuide.rst (edit) flume-ng-core/src/main/java/org/apache/flume/client/avro/ReliableSpoolingFileEventReader.java (edit) flume-ng-core/src/main/java/org/apache/flume/source/SpoolDirectorySource.java (edit) flume-ng-core/src/test/java/org/apache/flume/client/avro/TestReliableSpoolingFileEventReader.java (edit) flume-ng-core/src/main/java/org/apache/flume/source/SpoolDirectorySourceConfigurationConstants.java

            People

            • Assignee:
              arota Andrea Rota
              Reporter:
              arota Andrea Rota
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development