Hadoop Common
  1. Hadoop Common
  2. HADOOP-2055

JobConf should have a setInputPathFilter method

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.17.0
    • Component/s: None
    • Labels:
      None
    • Environment:

      all

    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      This issue provides users the ability to specify what paths to ignore for processing in the job input directory (apart from the filenames that start with "_" and "."). Defines two new APIs - FileInputFormat.setInputPathFilter(JobConf, PathFilter), and, FileInputFormat.getInputPathFilter(JobConf).
      Show
      This issue provides users the ability to specify what paths to ignore for processing in the job input directory (apart from the filenames that start with "_" and "."). Defines two new APIs - FileInputFormat.setInputPathFilter(JobConf, PathFilter), and, FileInputFormat.getInputPathFilter(JobConf).

      Description

      It should be possible to set a PathFilter for the input to avoid taking certain files as input data within the input directories.

      1. patch2055.txt
        8 kB
        Alejandro Abdelnur

        Activity

        Hide
        Owen O'Malley added a comment -

        This should be a static method on the FileInputFormat instead of JobConf, since it won't affect the framework, but only the FileInputFormat's behavior.

        Show
        Owen O'Malley added a comment - This should be a static method on the FileInputFormat instead of JobConf, since it won't affect the framework, but only the FileInputFormat's behavior.
        Hide
        Owen O'Malley added a comment -

        The method should probably also have a getter and most of them look like:

        public static void setInputPathFilter(JobConf job, PathFilter filter);
        public static PathFilter getInputPathFilter(JobConf job);
        
        Show
        Owen O'Malley added a comment - The method should probably also have a getter and most of them look like: public static void setInputPathFilter(JobConf job, PathFilter filter); public static PathFilter getInputPathFilter(JobConf job);
        Hide
        Alejandro Abdelnur added a comment -

        Having a static method on the FileInputFormat it would make difficult for an application that dispatches hadoop jobs (ie a webapp) to set filters on per job basis.

        IMO, it should be configurable at job level.

        Show
        Alejandro Abdelnur added a comment - Having a static method on the FileInputFormat it would make difficult for an application that dispatches hadoop jobs (ie a webapp) to set filters on per job basis. IMO, it should be configurable at job level.
        Hide
        Doug Cutting added a comment -

        > IMO, it should be configurable at job level.

        Please look more closely at the static methods Owen suggested. The job is a parameter.

        Show
        Doug Cutting added a comment - > IMO, it should be configurable at job level. Please look more closely at the static methods Owen suggested. The job is a parameter.
        Hide
        eric baldeschwieler added a comment -

        we support globing in input paths now. Doesn't that address this need?

        IE *.foo

        Show
        eric baldeschwieler added a comment - we support globing in input paths now. Doesn't that address this need? IE *.foo
        Hide
        Alejandro Abdelnur added a comment -

        Owen, Doug, got the static methos thing, that would work.

        Eric, using wildcards would not work as it allows you to tell what you want to include, but now what you don't want to include.

        For example, if I have some files like the CRC files (to track other type of information) and I would like to skip them.

        Show
        Alejandro Abdelnur added a comment - Owen, Doug, got the static methos thing, that would work. Eric, using wildcards would not work as it allows you to tell what you want to include, but now what you don't want to include. For example, if I have some files like the CRC files (to track other type of information) and I would like to skip them.
        Hide
        Alejandro Abdelnur added a comment -

        I've figured out (IMO) a cleaner way of implementing this feature:

        Adding the following 2 instance methods to the JobConf:

        • void setInputPathFilter(class<? extends PathFilter> pathFilter);
        • InputPathFilter getInputPathFilter();

        Modifying the FileInputFormat's listPaths() method to apply the hiddenFileFilter and (if set) the filter set in the jobconf.

        And still globbing works for regex inclusion, even if a path filter is set.

        By being able to specify a custom PathFilter it will be possible to create more complex filters such as exclusion ones and doing selections not possible to be done via regex.

        Show
        Alejandro Abdelnur added a comment - I've figured out (IMO) a cleaner way of implementing this feature: Adding the following 2 instance methods to the JobConf: void setInputPathFilter(class<? extends PathFilter> pathFilter); InputPathFilter getInputPathFilter(); Modifying the FileInputFormat's listPaths() method to apply the hiddenFileFilter and (if set) the filter set in the jobconf. And still globbing works for regex inclusion, even if a path filter is set. By being able to specify a custom PathFilter it will be possible to create more complex filters such as exclusion ones and doing selections not possible to be done via regex.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12378554/patch2055.txt
        against trunk revision 619744.

        @author +1. The patch does not contain any @author tags.

        tests included +1. The patch appears to include 3 new or modified tests.

        javadoc +1. The javadoc tool did not generate any warning messages.

        javac +1. The applied patch does not generate any new javac compiler warnings.

        release audit +1. The applied patch does not generate any new release audit warnings.

        findbugs +1. The patch does not introduce any new Findbugs warnings.

        core tests +1. The patch passed core unit tests.

        contrib tests +1. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2048/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2048/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2048/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2048/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12378554/patch2055.txt against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 3 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2048/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2048/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2048/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2048/console This message is automatically generated.
        Hide
        Alejandro Abdelnur added a comment -

        refactored patch to Owen's suggestion as the functionality is specific to File InputFormats.

        Show
        Alejandro Abdelnur added a comment - refactored patch to Owen's suggestion as the functionality is specific to File InputFormats.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12378623/patch2055.txt
        against trunk revision 619744.

        @author +1. The patch does not contain any @author tags.

        tests included +1. The patch appears to include 3 new or modified tests.

        javadoc +1. The javadoc tool did not generate any warning messages.

        javac +1. The applied patch does not generate any new javac compiler warnings.

        release audit +1. The applied patch does not generate any new release audit warnings.

        findbugs +1. The patch does not introduce any new Findbugs warnings.

        core tests +1. The patch passed core unit tests.

        contrib tests +1. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2061/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2061/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2061/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2061/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12378623/patch2055.txt against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 3 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2061/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2061/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2061/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2061/console This message is automatically generated.
        Hide
        Devaraj Das added a comment -

        I just committed this. Thanks, Alejandro!

        Show
        Devaraj Das added a comment - I just committed this. Thanks, Alejandro!
        Hide
        Hudson added a comment -
        Show
        Hudson added a comment - Integrated in Hadoop-trunk #445 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/445/ )

          People

          • Assignee:
            Alejandro Abdelnur
            Reporter:
            Alejandro Abdelnur
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development